-
Notifications
You must be signed in to change notification settings - Fork 38
Description
Describe the bug
Hi,
I wanted to play around with the initial "LLM Pre-training Single Node" example.
I am using recent 6c38eae commit of AutoModel
Steps/Code to reproduce bug
I ran it with the following command:
uv run torchrun --nproc-per-node=1 examples/llm_pretrain/pretrain.py -c examples/llm_pretrain/nanogpt_pretrain.yamlUnfortunately, I am getting the following error:
2025-12-09 23:16:24 | INFO | nemo_automodel.components.distributed.fsdp2 | World size is 1, skipping parallelization.
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/stefan/Repositories/Automodel/examples/llm_pretrain/pretrain.py", line 37, in <module>
[rank0]: main()
[rank0]: File "/home/stefan/Repositories/Automodel/examples/llm_pretrain/pretrain.py", line 32, in main
[rank0]: recipe.setup()
[rank0]: File "/home/stefan/Repositories/Automodel/nemo_automodel/recipes/llm/train_ft.py", line 984, in setup
[rank0]: model, model_state_dict_keys, self.optimizer, self.loss_fn, self.param_info = build_model_and_optimizer(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/stefan/Repositories/Automodel/nemo_automodel/recipes/llm/train_ft.py", line 313, in build_model_and_optimizer
[rank0]: checkpointer.load_base_model(
[rank0]: File "/home/stefan/Repositories/Automodel/nemo_automodel/components/checkpoint/checkpointing.py", line 368, in load_base_model
[rank0]: assert model_name is not None, "model_name is required when loading base model"
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^
[rank0]: AssertionError: model_name is required when loading base model
Expected behavior
Training should not fail, and a base model should not be needed, as the example shows a pretraining from scratch.
Additional context
I debugged that case a bit, and found out the following:
Automodel/nemo_automodel/recipes/llm/train_ft.py
Lines 984 to 1003 in 6c38eae
| model, model_state_dict_keys, self.optimizer, self.loss_fn, self.param_info = build_model_and_optimizer( | |
| self.dist_env.device, | |
| self.cfg.model, | |
| self.cfg.optimizer, | |
| self.peft_config, | |
| self.model_wrapper, | |
| has_packed_sequence=self.cfg.get("packed_sequence.packed_sequence_size", 0) > 0, | |
| seed=self.cfg.get("seed", 42), | |
| tp_size=self.cfg.get("distributed.tp_size", 1), | |
| cp_size=self.cfg.get("distributed.cp_size", 1), | |
| cfg_fp8=self.cfg.get("fp8", None), | |
| cfg_compile=self.cfg.get("compile", None), | |
| cfg_quantization=self.cfg.get("quantization", None), | |
| cfg_qat=self.cfg.get("qat", None), | |
| autopipeline=autopipeline, | |
| loss_fn=self.loss_fn, | |
| parallelize_fn=parallelize_fn, | |
| load_base_model=self.cfg.get("checkpoint.load_base_model", True), | |
| checkpointer=self.checkpointer, | |
| ) |
Interesting line is:
load_base_model=self.cfg.get("checkpoint.load_base_model", True),
So when the checkpoint.load_base_model configuration key is not set, it will fallback to True. Then I checked the configuration file located under examples/llm_pretrain/nanogpt_pretrain.yaml and indeed, there's no configuration option.
I manually added:
checkpoint:
load_base_model: falseAnd the bug could be resolved.
So maybe the configuration file of this example should be extended or the default value should be set to False in nemo_automodel/recipes/llm/train_ft.py (this could have unwanted side effects...).