Does 'completion_only_loss' work with eval_loss? Does it work with conversational styled prompt-completion datasets? #3412

MinkBlaxx · 2025-05-04T22:19:51Z

MinkBlaxx
May 4, 2025

Hi HF community. I'm training a qwen 2.5 1.5b instruct using accelerate and SFT.
Dataset is conversational prompt completion style (prompt: [{role: system, content: instruction}, {role:user, content: question}], completion: [{role:assistant, content: answer}]).
I'm trying to use completion only but I am seeing some really strange behavior during SFT.

I use a validation set, with eval loss being computed every so many steps.
The dataset is comprised of a good deal of repetition in the prompt side, making training on all text really susceptible to over-training, hence why I'm attempting completion only.
The dataset is fairly large, with 2 million q&a pairs. The questions are pretty unique with at most 500 repetitions of the same question floating around. However, the content the questions are asked on is pretty heavily duplicated, with a piece of content sometimes being present within 5000 - 10000 examples, and taking up the majority of the prompt side in terms of token count.

On the completion side the answers are answer only, and wouldn't be repeated, making memorizing over-training within the first epoch theoretically impossible, if completion only was working.

But that's not what I'm seeing at all, during training. I'm seeing some seriously suspect things that suggest to me that maybe completion_only_loss secretly doesn't support what I'm trying to do. For one, the training speed in not different between passing a vanilla conversational modeling dataset with just a 'messages' column.

But what's more damning is, even though the q&a is shuffled, the loss on the train data gets over trained to hell (<0.05) within the first 5% of the first epoch. The eval loss makes progress within the first 5% of the first epoch but then bounces hard by 10% as the model over-trains on train and worsens on eval performance.

Given the completions are answer only, there's not really much to memorize other than the eos tokens and role:assistant which would be common within the eval set as well. But eval loss is way off with the min being around 0.25 before bouncing off and getting worse.

Am I doing anything obviously wrong? Any insight is much appreciated.

reqs:

datasets==3.5.0
liger_kernel==0.5.8
accelerate==1.6.0
deepspeed
optimum
peft
torch==2.6.0
transformers==4.51.3
trl==0.17.0

train loop:

import torch
import datasets
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, BatchEncoding
import trl
from trl import SFTConfig, SFTTrainer

if __name__ == "__main__":
    model_name = "Qwen/Qwen2.5-1.5B-Instruct"
    model = AutoModelForCausalLM.from_pretrained(model_name, use_cache=False)

    train_set = datasets.load_from_disk('./ds/train')
    val_set = datasets.load_from_disk('./ds/val')

    # Trainer initialize.
    training_args = SFTConfig(
        max_length=4096.0,
        output_dir="./results",
        bf16=True,
        per_device_train_batch_size=1,
        per_device_eval_batch_size=1,
        learning_rate=1e-5,
        logging_steps=4000,
        save_steps=8000,
        eval_strategy="steps",
        eval_steps=8000,
        num_train_epochs=1,
        load_best_model_at_end=True,
        metric_for_best_model='eval_loss',
        gradient_accumulation_steps=1,
        gradient_checkpointing=True,
        gradient_checkpointing_kwargs = {'use_reentrant': False}, # Must be false for DDP.
        eval_accumulation_steps=1,
        ddp_timeout=3600*3,
        eos_token="<|im_end|>",
        # optim='adafactor',
        use_liger_kernel=True,
        completion_only_loss=True,
    )

    trainer = SFTTrainer(
        model,
        train_dataset=train_set,
        eval_dataset=val_set,
        args=training_args,
    )

    trainer.train()
    trainer.save_model('./trained_model_chat')

logs:

accelerate launch --config_file ./multi_gpu.yaml ./sft.py                    [ INSERT ]
[2025-05-04 12:47:59,719] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-04 12:47:59,761] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-04 12:47:59,762] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-04 12:47:59,763] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Converting train dataset to ChatML:   0%|                  | 1176/1674027 [00:00<02:25, 11517.65 examples/s][rank1]:[W504 12:48:05.018230474 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 1]  using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
Converting train dataset to ChatML:   0%|                  | 3811/1674027 [00:00<02:10, 12801.32 examples/s][rank2]:[W504 12:48:05.190135159 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 2]  using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
Converting train dataset to ChatML:   0%|                  | 5783/1674027 [00:00<02:10, 12794.43 examples/s][rank3]:[W504 12:48:06.551622166 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 3]  using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
Converting train dataset to ChatML: 100%|███████████████| 1674027/1674027 [02:12<00:00, 12629.12 examples/s]
Applying chat template to train dataset: 100%|███████████| 1674027/1674027 [05:51<00:00, 4762.04 examples/s]
Tokenizing train dataset: 100%|█████████████████████████| 1674027/1674027 [2:14:12<00:00, 207.90 examples/s]
Truncating train dataset: 100%|█████████████████████████| 1674027/1674027 [00:21<00:00, 76343.27 examples/s]
[rank0]:[W504 15:10:44.948722876 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
Converting eval dataset to ChatML: 100%|██████████████████| 209681/209681 [00:16<00:00, 12713.23 examples/s]
Applying chat template to eval dataset: 100%|██████████████| 209681/209681 [00:43<00:00, 4847.34 examples/s]
Tokenizing eval dataset: 100%|██████████████████████████████| 209681/209681 [15:39<00:00, 223.20 examples/s]
Truncating eval dataset: 100%|████████████████████████████| 209681/209681 [00:02<00:00, 76097.13 examples/s]
Applied Liger kernels to Qwen2
Applied Liger kernels to Qwen2
Applied Liger kernels to Qwen2
Applied Liger kernels to Qwen2
{'loss': 0.4024, 'grad_norm': 2.0077693462371826, 'learning_rate': 9.904446042718521e-06, 'num_tokens': 17121349.0, 'epoch': 0.01}
{'loss': 0.2437, 'grad_norm': 1.3494713306427002, 'learning_rate': 9.808868190974108e-06, 'epoch': 0.02}
{'eval_loss': 0.2887614369392395, 'eval_runtime': 3378.4636, 'eval_samples_per_second': 62.064, 'eval_steps_per_second': 15.516, 'eval_num_tokens': 34244111.0, 'epoch': 0.02}
{'loss': 0.1834, 'grad_norm': 1.3092602491378784, 'learning_rate': 9.71329033922969e-06, 'num_tokens': 51405520.0, 'epoch': 0.03}
{'loss': 0.1447, 'grad_norm': 1.7994308471679688, 'learning_rate': 9.617712487485275e-06, 'epoch': 0.04}
{'eval_loss': 0.2615707814693451, 'eval_runtime': 3439.9823, 'eval_samples_per_second': 60.954, 'eval_steps_per_second': 15.239, 'eval_num_tokens': 68583703.0, 'epoch': 0.04}
{'loss': 0.1157, 'grad_norm': 1.5624499320983887, 'learning_rate': 9.522134635740861e-06, 'num_tokens': 85674419.0, 'epoch': 0.05}
{'loss': 0.0942, 'grad_norm': 1.8618876934051514, 'learning_rate': 9.426556783996445e-06, 'epoch': 0.06}
{'eval_loss': 0.26513755321502686, 'eval_runtime': 3417.9433, 'eval_samples_per_second': 61.347, 'eval_steps_per_second': 15.337, 'eval_num_tokens': 102854055.0, 'epoch': 0.06}
{'loss': 0.0784, 'grad_norm': 1.1865952014923096, 'learning_rate': 9.33097893225203e-06, 'num_tokens': 120044264.0, 'epoch': 0.07}
{'loss': 0.0684, 'grad_norm': 1.1915651559829712, 'learning_rate': 9.235401080507614e-06, 'epoch': 0.08}
  8%|████▌                                                      | 32000/418507 [8:36:38<69:45:35,  1.54it/s]
 39%|█████████████████████████▎                                       | 20428/52421 [22:12<37:23, 14.26it/s]

Answered by MinkBlaxx

Jun 2, 2025

For anyone that finds their way here. It seems like liger might break completion only as per this issue in trl:
#3484

View full answer

qgallouedec · 2025-05-04T23:23:17Z

qgallouedec
May 4, 2025
Maintainer

Thanks for reporting, do you mind sharing a sample of your dataset so that we can try to reproduce?

4 replies

MinkBlaxx May 5, 2025
Author

It contains sensitive information so I'm unfortunately not at liberty to share it. I could hand sanitize a few example items if that's helpful, but it wouldn't be enough to run proper.

qgallouedec May 5, 2025
Maintainer

Yes, of course, I understand. The alternative is to share a dataset (even a small one) that mimics the dataset in question, with the same columns, the same formatting, prompts that look alike, and completions that look alike. The idea is just for us to be able to reproduce what you're observing.

MinkBlaxx May 5, 2025
Author

I can generate a synth dataset that exhibits the problem. Will need a few days.

MinkBlaxx May 9, 2025
Author

Training dataset to be downloaded to ./ds/train/

Validation dataset to be downloaded to ./ds/val/

The dataset viewer doesn't seem to work with the arrow format saved to disk, but if downloaded the above scripts will successfully load it from disk.

MinkBlaxx · 2025-05-09T17:07:08Z

MinkBlaxx
May 9, 2025
Author

I believe the issue is on my end. The training regime for this synthetic dataset works much more like I'd expect. I'll retract this discussion for now.

0 replies

MinkBlaxx · 2025-06-02T17:57:45Z

MinkBlaxx
Jun 2, 2025
Author

For anyone that finds their way here. It seems like liger might break completion only as per this issue in trl:
#3484

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Does 'completion_only_loss' work with eval_loss? Does it work with conversational styled prompt-completion datasets? #3412

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Does 'completion_only_loss' work with eval_loss? Does it work with conversational styled prompt-completion datasets? #3412

Uh oh!

Uh oh!

MinkBlaxx May 4, 2025

Replies: 3 comments · 4 replies

Uh oh!

qgallouedec May 4, 2025 Maintainer

Uh oh!

MinkBlaxx May 5, 2025 Author

Uh oh!

qgallouedec May 5, 2025 Maintainer

Uh oh!

MinkBlaxx May 5, 2025 Author

Uh oh!

MinkBlaxx May 9, 2025 Author

Uh oh!

MinkBlaxx May 9, 2025 Author

Uh oh!

MinkBlaxx Jun 2, 2025 Author

MinkBlaxx
May 4, 2025

Replies: 3 comments 4 replies

qgallouedec
May 4, 2025
Maintainer

MinkBlaxx May 5, 2025
Author

qgallouedec May 5, 2025
Maintainer

MinkBlaxx May 5, 2025
Author

MinkBlaxx May 9, 2025
Author

MinkBlaxx
May 9, 2025
Author

MinkBlaxx
Jun 2, 2025
Author