feat(grpo_trainer.py): Variational Sequence-Level Soft Policy Optimization (VESPO)#5199
Open
casinca wants to merge 8 commits intohuggingface:mainfrom
Open
feat(grpo_trainer.py): Variational Sequence-Level Soft Policy Optimization (VESPO)#5199casinca wants to merge 8 commits intohuggingface:mainfrom
grpo_trainer.py): Variational Sequence-Level Soft Policy Optimization (VESPO)#5199casinca wants to merge 8 commits intohuggingface:mainfrom