ChatML template bug in test chat with unsloth

### Reminder

- [x] I have read the above rules and searched the existing issues.

### System Info

0.9.4, Windows, Python 3.12.10

### Reproduction

After training Lora based on a ChatML template with unsloth (e.g., Magnum v2 4B or Hermes 8B), I am unable to load the model for test chat with unsloth.
```
[WARNING|2026-03-04 21:16:43] llamafactory.extras.ploting:149 >> No metric eval_loss to plot.
[WARNING|2026-03-04 21:16:43] llamafactory.extras.ploting:149 >> No metric eval_accuracy to plot.
[INFO|tokenization_utils_base.py:2111] 2026-03-04 21:17:00,548 >> loading file tokenizer.json from cache at C:\Users\Mykee\.cache\huggingface\hub\models--anthracite-org--magnum-v2-4b\snapshots\31a45f774c4db8005f645c7fbd1345ad47b45ceb\tokenizer.json
[INFO|tokenization_utils_base.py:2111] 2026-03-04 21:17:00,549 >> loading file tokenizer.model from cache at None
[INFO|tokenization_utils_base.py:2111] 2026-03-04 21:17:00,549 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:2111] 2026-03-04 21:17:00,549 >> loading file special_tokens_map.json from cache at C:\Users\Mykee\.cache\huggingface\hub\models--anthracite-org--magnum-v2-4b\snapshots\31a45f774c4db8005f645c7fbd1345ad47b45ceb\special_tokens_map.json
[INFO|tokenization_utils_base.py:2111] 2026-03-04 21:17:00,549 >> loading file tokenizer_config.json from cache at C:\Users\Mykee\.cache\huggingface\hub\models--anthracite-org--magnum-v2-4b\snapshots\31a45f774c4db8005f645c7fbd1345ad47b45ceb\tokenizer_config.json
[INFO|tokenization_utils_base.py:2111] 2026-03-04 21:17:00,549 >> loading file chat_template.jinja from cache at None
[INFO|tokenization_utils_base.py:2380] 2026-03-04 21:17:00,753 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|configuration_utils.py:765] 2026-03-04 21:17:01,884 >> loading configuration file config.json from cache at C:\Users\Mykee\.cache\huggingface\hub\models--anthracite-org--magnum-v2-4b\snapshots\31a45f774c4db8005f645c7fbd1345ad47b45ceb\config.json
[INFO|configuration_utils.py:839] 2026-03-04 21:17:01,884 >> Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "dtype": "bfloat16",
  "eos_token_id": 128019,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 9216,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "transformers_version": "4.57.6",
  "use_cache": false,
  "vocab_size": 128256
}

[INFO|tokenization_utils_base.py:2111] 2026-03-04 21:17:02,334 >> loading file tokenizer.json from cache at C:\Users\Mykee\.cache\huggingface\hub\models--anthracite-org--magnum-v2-4b\snapshots\31a45f774c4db8005f645c7fbd1345ad47b45ceb\tokenizer.json
[INFO|tokenization_utils_base.py:2111] 2026-03-04 21:17:02,334 >> loading file tokenizer.model from cache at None
[INFO|tokenization_utils_base.py:2111] 2026-03-04 21:17:02,334 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:2111] 2026-03-04 21:17:02,334 >> loading file special_tokens_map.json from cache at C:\Users\Mykee\.cache\huggingface\hub\models--anthracite-org--magnum-v2-4b\snapshots\31a45f774c4db8005f645c7fbd1345ad47b45ceb\special_tokens_map.json
[INFO|tokenization_utils_base.py:2111] 2026-03-04 21:17:02,334 >> loading file tokenizer_config.json from cache at C:\Users\Mykee\.cache\huggingface\hub\models--anthracite-org--magnum-v2-4b\snapshots\31a45f774c4db8005f645c7fbd1345ad47b45ceb\tokenizer_config.json
[INFO|tokenization_utils_base.py:2111] 2026-03-04 21:17:02,334 >> loading file chat_template.jinja from cache at None
[INFO|tokenization_utils_base.py:2380] 2026-03-04 21:17:02,518 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|2026-03-04 21:17:02] llamafactory.data.template:144 >> Add <|im_start|> to stop words.
[INFO|configuration_utils.py:765] 2026-03-04 21:17:02,851 >> loading configuration file config.json from cache at C:\Users\Mykee\.cache\huggingface\hub\models--anthracite-org--magnum-v2-4b\snapshots\31a45f774c4db8005f645c7fbd1345ad47b45ceb\config.json
[INFO|configuration_utils.py:839] 2026-03-04 21:17:02,851 >> Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "dtype": "bfloat16",
  "eos_token_id": 128019,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 9216,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "transformers_version": "4.57.6",
  "use_cache": false,
  "vocab_size": 128256
}

[WARNING|logging.py:328] 2026-03-04 21:17:02,851 >> `torch_dtype` is deprecated! Use `dtype` instead!
[INFO|2026-03-04 21:17:02] llamafactory.model.model_utils.kv_cache:144 >> KV cache is enabled for faster generation.
E:\LlamaFactory\src\llamafactory\model\model_utils\unsloth.py:89: UserWarning: WARNING: Unsloth should be imported before [trl, transformers, peft] to ensure all optimizations are applied. Your code may run slower or encounter memory issues without these optimizations.

Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth import FastLanguageModel  # type: ignore
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Unsloth: Your Flash Attention 2 installation seems to be broken. Using Xformers instead. No performance changes will be seen.
🦥 Unsloth Zoo will now patch everything to make training faster!
[INFO|configuration_utils.py:765] 2026-03-04 21:17:06,654 >> loading configuration file config.json from cache at C:\Users\Mykee\.cache\huggingface\hub\models--anthracite-org--magnum-v2-4b\snapshots\31a45f774c4db8005f645c7fbd1345ad47b45ceb\config.json
[INFO|configuration_utils.py:839] 2026-03-04 21:17:06,670 >> Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "dtype": "bfloat16",
  "eos_token_id": 128019,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 9216,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "transformers_version": "4.57.6",
  "use_cache": false,
  "vocab_size": 128256
}

Unsloth: WARNING `trust_remote_code` is True.
Are you certain you want to do remote code execution?
==((====))==  Unsloth 2026.3.3: Fast Llama patching. Transformers: 4.57.6.
   \\   /|    NVIDIA GeForce RTX 3090. Num GPUs = 1. Max memory: 24.0 GB. Platform: Windows.
O^O/ \_/ \    Torch: 2.10.0+cu130. CUDA: 8.6. CUDA Toolkit: 13.0. Triton: 3.6.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
[INFO|configuration_utils.py:765] 2026-03-04 21:17:12,598 >> loading configuration file config.json from cache at C:\Users\Mykee\.cache\huggingface\hub\models--anthracite-org--magnum-v2-4b\snapshots\31a45f774c4db8005f645c7fbd1345ad47b45ceb\config.json
[INFO|configuration_utils.py:839] 2026-03-04 21:17:12,598 >> Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "dtype": "bfloat16",
  "eos_token_id": 128019,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 9216,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "transformers_version": "4.57.6",
  "use_cache": false,
  "vocab_size": 128256
}

[INFO|configuration_utils.py:765] 2026-03-04 21:17:12,809 >> loading configuration file config.json from cache at C:\Users\Mykee\.cache\huggingface\hub\models--anthracite-org--magnum-v2-4b\snapshots\31a45f774c4db8005f645c7fbd1345ad47b45ceb\config.json
[INFO|configuration_utils.py:839] 2026-03-04 21:17:12,809 >> Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "dtype": "bfloat16",
  "eos_token_id": 128019,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 9216,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "transformers_version": "4.57.6",
  "use_cache": false,
  "vocab_size": 128256
}

[INFO|modeling_utils.py:1172] 2026-03-04 21:17:12,809 >> loading weights file model.safetensors from cache at C:\Users\Mykee\.cache\huggingface\hub\models--anthracite-org--magnum-v2-4b\snapshots\31a45f774c4db8005f645c7fbd1345ad47b45ceb\model.safetensors.index.json
[INFO|modeling_utils.py:2341] 2026-03-04 21:17:12,809 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:986] 2026-03-04 21:17:12,809 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "eos_token_id": 128019,
  "use_cache": false
}

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.35s/it]
[INFO|configuration_utils.py:941] 2026-03-04 21:17:15,728 >> loading configuration file generation_config.json from cache at C:\Users\Mykee\.cache\huggingface\hub\models--anthracite-org--magnum-v2-4b\snapshots\31a45f774c4db8005f645c7fbd1345ad47b45ceb\generation_config.json
[INFO|configuration_utils.py:986] 2026-03-04 21:17:15,728 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "do_sample": true,
  "eos_token_id": 128001
}

[INFO|dynamic_module_utils.py:423] 2026-03-04 21:17:15,878 >> Could not locate the custom_generate/generate.py inside anthracite-org/magnum-v2-4b.
Traceback (most recent call last):
  File "E:\LlamaFactory\venv\Lib\site-packages\gradio\queueing.py", line 849, in process_events
    response = await route_utils.call_process_api(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\LlamaFactory\venv\Lib\site-packages\gradio\route_utils.py", line 354, in call_process_api
    output = await app.get_blocks().process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\LlamaFactory\venv\Lib\site-packages\gradio\blocks.py", line 2191, in process_api
    result = await self.call_function(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\LlamaFactory\venv\Lib\site-packages\gradio\blocks.py", line 1710, in call_function
    prediction = await utils.async_iteration(iterator)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\LlamaFactory\venv\Lib\site-packages\gradio\utils.py", line 760, in async_iteration
    return await anext(iterator)
           ^^^^^^^^^^^^^^^^^^^^^
  File "E:\LlamaFactory\venv\Lib\site-packages\gradio\utils.py", line 751, in __anext__
    return await anyio.to_thread.run_sync(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\LlamaFactory\venv\Lib\site-packages\anyio\to_thread.py", line 63, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\LlamaFactory\venv\Lib\site-packages\anyio\_backends\_asyncio.py", line 2502, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "E:\LlamaFactory\venv\Lib\site-packages\anyio\_backends\_asyncio.py", line 986, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\LlamaFactory\venv\Lib\site-packages\gradio\utils.py", line 734, in run_sync_iterator_async
    return next(iterator)
           ^^^^^^^^^^^^^^
  File "E:\LlamaFactory\venv\Lib\site-packages\gradio\utils.py", line 898, in gen_wrapper
    response = next(iterator)
               ^^^^^^^^^^^^^^
  File "E:\LlamaFactory\src\llamafactory\webui\chatter.py", line 158, in load_model
    super().__init__(args)
  File "E:\LlamaFactory\src\llamafactory\chat\chat_model.py", line 53, in __init__
    self.engine: BaseEngine = HuggingfaceEngine(model_args, data_args, finetuning_args, generating_args)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\LlamaFactory\src\llamafactory\chat\hf_engine.py", line 59, in __init__
    self.model = load_model(
                 ^^^^^^^^^^^
  File "E:\LlamaFactory\src\llamafactory\model\loader.py", line 189, in load_model
    model = init_adapter(config, model, model_args, finetuning_args, is_trainable)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\LlamaFactory\src\llamafactory\model\adapter.py", line 360, in init_adapter
    model = _setup_lora_tuning(
            ^^^^^^^^^^^^^^^^^^^
  File "E:\LlamaFactory\src\llamafactory\model\adapter.py", line 208, in _setup_lora_tuning
    model = load_unsloth_peft_model(config, model_args, finetuning_args, is_trainable=is_trainable)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\LlamaFactory\src\llamafactory\model\model_utils\unsloth.py", line 96, in load_unsloth_peft_model
    model, _ = FastLanguageModel.from_pretrained(**unsloth_kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\LlamaFactory\venv\Lib\site-packages\unsloth\models\loader.py", line 704, in from_pretrained
    model, tokenizer = dispatch_model.from_pretrained(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\LlamaFactory\venv\Lib\site-packages\unsloth\models\llama.py", line 2501, in from_pretrained
    tokenizer = load_correct_tokenizer(
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\LlamaFactory\venv\Lib\site-packages\unsloth\tokenizer_utils.py", line 622, in load_correct_tokenizer
    chat_template = fix_chat_template(tokenizer)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\LlamaFactory\venv\Lib\site-packages\unsloth\tokenizer_utils.py", line 734, in fix_chat_template
    raise RuntimeError(
RuntimeError: Unsloth: The tokenizer `saves\Llama-3.1-8B-Instruct\lora\train_2026-03-04-20-39-27-Magnum-4B-1`
does not have a {% if add_generation_prompt %} for generation purposes.
Please file a bug report to the maintainers of `saves\Llama-3.1-8B-Instruct\lora\train_2026-03-04-20-39-27-Magnum-4B-1` - thanks!
```

I also reported this bug to unsloth here, along with several logs:
https://github.com/unslothai/unsloth/issues/4150


### Others

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ChatML template bug in test chat with unsloth #10247

Reminder

System Info

Reproduction

Others

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

ChatML template bug in test chat with unsloth #10247

Description

Reminder

System Info

Reproduction

Others

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions