Skip to content

Weird logits and model starts degeneration while training DPO #77

@DungNasSa10

Description

@DungNasSa10

Recently, I have experimented DPO training for Vietnamese. I start with a strong SFT model, which is vinai/PhoGPT-4B-Chat, and follow the method described in CHEN, Zixiang, et al. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024. to make preference dataset from my own SFT dataset. I use trl for traninig with the config:

  • Deepspeed zero 3 offload
  • beta = 0.1
  • global_batch_size 128
  • learning_rate 1e-6
  • learning_rate_scheduler cosine
  • optim adam_torch
  • bf16
    While training, the loss decreases very fast but after the first epoch, the logits of both chosen and rejected decreases to 0 and model suffer from degeneration (it generates repeated character `) after 1 epoch.
    Here is the full logs of the training process and a sample output of model, you can read more in column "PhoGPT-4B-Chat-SPIN-0-4K-one-turn-ep1" in the attached google sheet:
    434616463_1399270230726618_2100947106694925368_n
    Screenshot from 2024-04-09 14-54-15
    Screenshot from 2024-04-09 14-54-08

Do you have any suggest for this problem?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions