LLM Pre-training example: `model_name` is required when loading base model

**Describe the bug**

Hi,

I wanted to play around with the initial ["LLM Pre-training Single Node"](https://github.com/NVIDIA-NeMo/Automodel?tab=readme-ov-file#llm-pre-training-single-node) example.

I am using recent 6c38eaee059de497b37509fe3b9132da37baf9a7 commit of AutoModel

**Steps/Code to reproduce bug**

I ran it with the following command:

```bash
uv run torchrun --nproc-per-node=1   examples/llm_pretrain/pretrain.py   -c examples/llm_pretrain/nanogpt_pretrain.yaml
```

Unfortunately, I am getting the following error:

```
2025-12-09 23:16:24 | INFO | nemo_automodel.components.distributed.fsdp2 | World size is 1, skipping parallelization.                                                                                                
[rank0]: Traceback (most recent call last):                                                                                                                                                                          
[rank0]:   File "/home/stefan/Repositories/Automodel/examples/llm_pretrain/pretrain.py", line 37, in <module>                                                                                                        
[rank0]:     main()                                                                                                                                                                                                  
[rank0]:   File "/home/stefan/Repositories/Automodel/examples/llm_pretrain/pretrain.py", line 32, in main                                                                                                            
[rank0]:     recipe.setup()                                                                                                                                                                                          
[rank0]:   File "/home/stefan/Repositories/Automodel/nemo_automodel/recipes/llm/train_ft.py", line 984, in setup                                                                                                     
[rank0]:     model, model_state_dict_keys, self.optimizer, self.loss_fn, self.param_info = build_model_and_optimizer(                                                                                                
[rank0]:                                                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                
[rank0]:   File "/home/stefan/Repositories/Automodel/nemo_automodel/recipes/llm/train_ft.py", line 313, in build_model_and_optimizer                                                                                 
[rank0]:     checkpointer.load_base_model(                                                                                                                                                                           
[rank0]:   File "/home/stefan/Repositories/Automodel/nemo_automodel/components/checkpoint/checkpointing.py", line 368, in load_base_model                                                                            
[rank0]:     assert model_name is not None, "model_name is required when loading base model"                                                                                                                         
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                           
[rank0]: AssertionError: model_name is required when loading base model 
```

**Expected behavior**

Training should not fail, and a base model should not be needed, as the example shows a pretraining from scratch.

**Additional context**

I debugged that case a bit, and found out the following:

https://github.com/NVIDIA-NeMo/Automodel/blob/6c38eaee059de497b37509fe3b9132da37baf9a7/nemo_automodel/recipes/llm/train_ft.py#L984-L1003

Interesting line is:

```
load_base_model=self.cfg.get("checkpoint.load_base_model", True),
```

So when the `checkpoint.load_base_model` configuration key is not set, it will fallback to `True`. Then I checked the configuration file located under `examples/llm_pretrain/nanogpt_pretrain.yaml` and indeed, there's no configuration option.

I manually added:

```yaml
checkpoint:
  load_base_model: false
```

And the bug could be resolved.

So maybe the configuration file of this example should be extended or the default value should be set to `False` in `nemo_automodel/recipes/llm/train_ft.py` (this could have unwanted side effects...).

	model, model_state_dict_keys, self.optimizer, self.loss_fn, self.param_info = build_model_and_optimizer(
	self.dist_env.device,
	self.cfg.model,
	self.cfg.optimizer,
	self.peft_config,
	self.model_wrapper,
	has_packed_sequence=self.cfg.get("packed_sequence.packed_sequence_size", 0) > 0,
	seed=self.cfg.get("seed", 42),
	tp_size=self.cfg.get("distributed.tp_size", 1),
	cp_size=self.cfg.get("distributed.cp_size", 1),
	cfg_fp8=self.cfg.get("fp8", None),
	cfg_compile=self.cfg.get("compile", None),
	cfg_quantization=self.cfg.get("quantization", None),
	cfg_qat=self.cfg.get("qat", None),
	autopipeline=autopipeline,
	loss_fn=self.loss_fn,
	parallelize_fn=parallelize_fn,
	load_base_model=self.cfg.get("checkpoint.load_base_model", True),
	checkpointer=self.checkpointer,
	)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LLM Pre-training example: `model_name` is required when loading base model #950

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LLM Pre-training example: model_name is required when loading base model #950

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

LLM Pre-training example: `model_name` is required when loading base model #950