Skip to content

LLM Pre-training example: model_name is required when loading base model #950

@stefan-it

Description

@stefan-it

Describe the bug

Hi,

I wanted to play around with the initial "LLM Pre-training Single Node" example.

I am using recent 6c38eae commit of AutoModel

Steps/Code to reproduce bug

I ran it with the following command:

uv run torchrun --nproc-per-node=1   examples/llm_pretrain/pretrain.py   -c examples/llm_pretrain/nanogpt_pretrain.yaml

Unfortunately, I am getting the following error:

2025-12-09 23:16:24 | INFO | nemo_automodel.components.distributed.fsdp2 | World size is 1, skipping parallelization.                                                                                                
[rank0]: Traceback (most recent call last):                                                                                                                                                                          
[rank0]:   File "/home/stefan/Repositories/Automodel/examples/llm_pretrain/pretrain.py", line 37, in <module>                                                                                                        
[rank0]:     main()                                                                                                                                                                                                  
[rank0]:   File "/home/stefan/Repositories/Automodel/examples/llm_pretrain/pretrain.py", line 32, in main                                                                                                            
[rank0]:     recipe.setup()                                                                                                                                                                                          
[rank0]:   File "/home/stefan/Repositories/Automodel/nemo_automodel/recipes/llm/train_ft.py", line 984, in setup                                                                                                     
[rank0]:     model, model_state_dict_keys, self.optimizer, self.loss_fn, self.param_info = build_model_and_optimizer(                                                                                                
[rank0]:                                                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                
[rank0]:   File "/home/stefan/Repositories/Automodel/nemo_automodel/recipes/llm/train_ft.py", line 313, in build_model_and_optimizer                                                                                 
[rank0]:     checkpointer.load_base_model(                                                                                                                                                                           
[rank0]:   File "/home/stefan/Repositories/Automodel/nemo_automodel/components/checkpoint/checkpointing.py", line 368, in load_base_model                                                                            
[rank0]:     assert model_name is not None, "model_name is required when loading base model"                                                                                                                         
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                           
[rank0]: AssertionError: model_name is required when loading base model 

Expected behavior

Training should not fail, and a base model should not be needed, as the example shows a pretraining from scratch.

Additional context

I debugged that case a bit, and found out the following:

model, model_state_dict_keys, self.optimizer, self.loss_fn, self.param_info = build_model_and_optimizer(
self.dist_env.device,
self.cfg.model,
self.cfg.optimizer,
self.peft_config,
self.model_wrapper,
has_packed_sequence=self.cfg.get("packed_sequence.packed_sequence_size", 0) > 0,
seed=self.cfg.get("seed", 42),
tp_size=self.cfg.get("distributed.tp_size", 1),
cp_size=self.cfg.get("distributed.cp_size", 1),
cfg_fp8=self.cfg.get("fp8", None),
cfg_compile=self.cfg.get("compile", None),
cfg_quantization=self.cfg.get("quantization", None),
cfg_qat=self.cfg.get("qat", None),
autopipeline=autopipeline,
loss_fn=self.loss_fn,
parallelize_fn=parallelize_fn,
load_base_model=self.cfg.get("checkpoint.load_base_model", True),
checkpointer=self.checkpointer,
)

Interesting line is:

load_base_model=self.cfg.get("checkpoint.load_base_model", True),

So when the checkpoint.load_base_model configuration key is not set, it will fallback to True. Then I checked the configuration file located under examples/llm_pretrain/nanogpt_pretrain.yaml and indeed, there's no configuration option.

I manually added:

checkpoint:
  load_base_model: false

And the bug could be resolved.

So maybe the configuration file of this example should be extended or the default value should be set to False in nemo_automodel/recipes/llm/train_ft.py (this could have unwanted side effects...).

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions