-
Notifications
You must be signed in to change notification settings - Fork 233
Open
Description
Recently, I have experimented DPO training for Vietnamese. I start with a strong SFT model, which is vinai/PhoGPT-4B-Chat, and follow the method described in CHEN, Zixiang, et al. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024. to make preference dataset from my own SFT dataset. I use trl for traninig with the config:
- Deepspeed zero 3 offload
- beta = 0.1
- global_batch_size 128
- learning_rate 1e-6
- learning_rate_scheduler cosine
- optim adam_torch
- bf16
While training, the loss decreases very fast but after the first epoch, the logits of both chosen and rejected decreases to 0 and model suffer from degeneration (it generates repeated character `) after 1 epoch.
Here is the full logs of the training process and a sample output of model, you can read more in column "PhoGPT-4B-Chat-SPIN-0-4K-one-turn-ep1" in the attached google sheet:



Do you have any suggest for this problem?
Metadata
Metadata
Assignees
Labels
No labels