Weird logits and model starts degeneration while training DPO

Recently, I have experimented DPO training for Vietnamese. I start with a strong SFT model, which is [vinai/PhoGPT-4B-Chat](https://huggingface.co/vinai/PhoGPT-4B-Chat), and follow the method described in [CHEN, Zixiang, et al. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024.](https://arxiv.org/abs/2401.01335v2) to make preference dataset from my own SFT dataset. I use trl for traninig with the config:
- Deepspeed zero 3 offload
- beta = 0.1
- global_batch_size 128
- learning_rate 1e-6
- learning_rate_scheduler cosine
- optim adam_torch
- bf16
While training, the loss decreases very fast but after the first epoch, the logits of both chosen and rejected decreases to 0 and model suffer from degeneration (it generates repeated character `) after 1 epoch.
Here is the full logs of the training process and a sample output of model, you can read more in column "PhoGPT-4B-Chat-SPIN-0-4K-one-turn-ep1" in the attached google sheet:
![434616463_1399270230726618_2100947106694925368_n](https://github.com/eric-mitchell/direct-preference-optimization/assets/84455292/81652458-76a9-4692-a449-85841831a74e)
![Screenshot from 2024-04-09 14-54-15](https://github.com/eric-mitchell/direct-preference-optimization/assets/84455292/86f340cb-d2a5-40e7-9bb7-c934f5b49a60)
![Screenshot from 2024-04-09 14-54-08](https://github.com/eric-mitchell/direct-preference-optimization/assets/84455292/ce5aa242-6984-480b-877e-152cdc9ed4d1)

Do you have any suggest for this problem?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Weird logits and model starts degeneration while training DPO #77

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Weird logits and model starts degeneration while training DPO #77

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions