-
Notifications
You must be signed in to change notification settings - Fork 42
Description
Describe the bug
- I used this config
examples/llm_finetune/qwen/qwen3_moe_30b_te_deepep.yaml. - I set
checkpoint.enabledto True and changedcheckpoint.checkpoint_dirto my desired path and started training using this commandautomodel finetune llm -c qwen3_moe_30b_te_deepep.yaml - Training works as expected and loss decreases smoothly.
- I used vLLM to generate outputs and I found the outputs to be complete junk.
Truncated example output:
" \r\n and and and and and and.\n and and and and and and and and and and and and \n and and and and and and and and and plant, to \n on and with and and and and this and and and and and and and in and and and and,,and and and \n and and and \r\n and \r\n\r\n and at and plant and.\r\n and and and and \ufffd and \n
- The model fails to follow any instruction and output has some invalid Unicode characters. Seems like model has lost its vocabulary and produces nonsensical terms.
Steps/Code to reproduce bug
- Enable checkpointing in
examples/llm_finetune/qwen/qwen3_moe_30b_te_deepep.yamland changecheckpoint.checkpoint_diras per your case. - Run
automodel finetune llm -c qwen3_moe_30b_te_deepep.yaml. - Pass the consolidated checkpoint path to a vLLM chat function. Below is the code snippet
from vllm import LLM, SamplingParams
if __name__ == '__main__':
load_path = "" # your consolidated checkpoint path. For ex: `/workspace/ckpts/epoch_1_step_47/model/consolidated`
model = LLM(model=load_path, dtype="bfloat16", tensor_parallel_size=8)
params = SamplingParams(max_tokens=1024, temperature=1.0, top_k=100)
test_queries = [
"Then, the man writes over the snow covering the window of a car, and a woman wearing winter clothes smiles. then",
]
test_prompts = [
[{
'role': 'user',
'content': p
}, ]
for p in test_queries
]
response = model.chat(test_prompts, params)
print(response[0].outputs[0].text)
Setup:
- nemo-automodel version: Version: 0.2.0
- vLLM version: 0.10.2
- GPU: 8× NVIDIA H100 80GB HBM3
- Inference dtype: bf16
- CUDA Version: 12.2
Expected behavior:
The current output (as mentioned above) is purely random or junk. I expect the output to have some sensible tokens.
Additional context:
A)
I observed that the size of the consolidated checkpoint is very low than the checkpoint of the original model.
Original model from HF: https://huggingface.co/Qwen/Qwen3-30B-A3B
Checkpoint size of the original model in HF: 61.1 GB
Size of the consolidated training checkpoint created by the nemo-automodel trainer: ~10 GB
B)
Like suggested here docs/guides/checkpointing.md, I also tried loading the consolidated checkpoint using HF and generating some outputs and it's still junk.
C)
While loading the consolidated checkpoint from HF, I got a message saying many weights of the model aren't initialized from the checkpoint and are newly initialized.
I used this code for B and C.
I just replaced model_path with path to the consolidated checkpoint and changed prompt according to my use case.