Unable to get good results with knowledge distillation after the update of online distillation #2957

GrassHeadd · 2025-07-21T05:28:38Z

GrassHeadd
Jul 21, 2025

Issue

While trying to replicate the author's knowledge distillation, I came into many issues with regards to the compatibility of the example dataset and also the file. After fixing the issue, I was able to train the model, yet, i was not able to achieve satisfiable results, in fact, the result are very very bad. The model just repeatedly output the last generated token like a broken record. E.g. When prompted with "how are you?", it just outputs "? ? ? ? ? ..."

Expected behaviour:

Model able to train a model that is able to work well

Actual behaviour:

Model only repeatedly outputs the last token

Steps to replicate

Adjust the axolotl/src/axolotl/prompt_strategies/__init__.py file's

      if (
          strategy.split(".")[-1].startswith("load_")
          or strategy.split(".")[-1] == "load"
      ):
          load_fn = strategy.split(".")[-1]
          strategy = ".".join(strategy.split(".")[:-1])
      elif len(strategy.split(".")) > 1:
          try:
              importlib.import_module(
                  "." + strategy.split(".")[-1],
                  ".".join(strategy.split(".")[:-1]),
              )
              package = ".".join(strategy.split(".")[:-1])
              strategy = strategy.split(".")[-1]
          except ModuleNotFoundError:
              pass

into

        parts = strategy.split(".")

        if parts[-1].startswith("load_") or parts[-1] == "load":
            load_fn = parts[-1]
            strategy = parts[-2]
            package = ".".join(parts[:-2])
        elif len(parts) > 1:
            try:
                importlib.import_module(
                    "." + parts[-1],
                    ".".join(parts[:-1]),
                )
                package = ".".join(parts[:-1])
                strategy = parts[-1]
            except ModuleNotFoundError:
                pass

Create a kd_test_config.yml file in the base folder
Add the following script into it

base_model: meta-llama/Llama-3.2-1B
tokenizer_config: meta-llama/Llama-3.2-3B
# Automatically upload checkpoint and final model to HF
#hub_model_id: axolotl-ai-co/numina-3b-v4-zscore-ep3-lr3e-5-0_5-0_5

plugins:
  - axolotl.integrations.kd.KDPlugin
  - axolotl.integrations.liger.LigerPlugin

liger_rms_norm: true
liger_glu_activation: true

torch_compile: true 

strict: false

chat_template: llama3

kd_trainer: true
kd_ce_alpha: 0.1
kd_alpha: 0.9
kd_temperature: 1.0
  # kd_zscore_base_temp: 1.0

dataloader_prefetch_factor: 256
dataloader_num_workers: 4
dataloader_pin_memory: true

gc_steps: -1  # gc at the end of each epoch

datasets:
- field_messages: messages_combined
  message_field_content: content
  message_field_role: role
  logprobs_field: llm_text_generation_vllm_logprobs
  path: winglian/evolkit-logprobs-pipeline-75k-v2
  type: axolotl.integrations.kd.chat_template.load_legacy
  split: train
  temperature: 1.0

dataset_prepared_path: last_run_prepared
val_set_size: 0
output_dir: ./outputs/out-1b-kd-more-saves

sequence_len: 2048
sample_packing: true
pad_to_sequence_len: true

wandb_project: test_distillation-tai
wandb_entity: grasshead-nus
wandb_watch: "all"
wandb_name: "knowledge_distil"
wandb_log_model: "checkpoint"

gradient_accumulation_steps: 1
micro_batch_size: 4
num_epochs: 1
optimizer: adamw_torch_fused
lr_scheduler: cosine
learning_rate: 3e-5
save_safetensors: true

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: true

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 100
evals_per_epoch: 2
eval_table_size:
saves_per_epoch: 5
debug:
#  deepspeed: deepspeed_configs/zero1.json
weight_decay: 0.0
special_tokens:
  pad_token: <|finetune_right_pad_id|>
  eos_token: <|eot_id|>

Run axolotl train kd_test_config.yml in the terminal

NanoCode012 · 2025-07-21T05:34:35Z

NanoCode012
Jul 21, 2025
Maintainer

Hey, could you clarify what you mean by "I came into many issues with regards to the compatibility of the example dataset and also the file."?

What was step 1 for? Was there a specific error that led to that change?

After fixing the issue, I was able to train the model, yet, i was not able to achieve satisfiable results, in fact, the result are very very bad.

Thanks for the feedback. Did you mean this is a new issue for online distillation only? If you were to go with offline distillation, how was it?

1 reply

GrassHeadd Jul 21, 2025
Author

Hi yes,

The issue is mostly with regards to the updated offline kd requiring different parameters from the old one, leading to me not being able to use the dataset given in the readme in kd to test it out, hence, i had to use the older version of kd's load method of dataset. Also, there are issues with parsing the keywords after specifying to use "load_legacy" in the yml file which i highlighted below in 2
i am not sure why, but if you check out step one, i had to parse the arguments differently in order for the detection of the keyword of loading to work. Hence, I had to adjust the parsing logic to detect out the "load" and "strategy" keyword
no, while online distillation is a very soon target of mine, i'm currently talking about setting up the offline distillation. Essentially, i tried to replicate this model by you guys: https://huggingface.co/axolotl-ai-co/kd-llama-1b-evolkit-distill-kd-ratio-0_9, but am unable to make it work

please do let me know anything else that could help you understand the problem better!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Unable to get good results with knowledge distillation after the update of online distillation #2957

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Unable to get good results with knowledge distillation after the update of online distillation #2957

Uh oh!

GrassHeadd Jul 21, 2025

Issue

Expected behaviour:

Actual behaviour:

Steps to replicate

Replies: 1 comment · 1 reply

Uh oh!

NanoCode012 Jul 21, 2025 Maintainer

Uh oh!

GrassHeadd Jul 21, 2025 Author

GrassHeadd
Jul 21, 2025

Replies: 1 comment 1 reply

NanoCode012
Jul 21, 2025
Maintainer

GrassHeadd Jul 21, 2025
Author