feat(grpo_trainer.py): Variational Sequence-Level Soft Policy Optimization (VESPO)
#15505
| Job | Run time |
|---|---|
| 14m 16s | |
| 13m 27s | |
| 19s | |
| 22m 14s | |
| 22m 24s | |
| 21m 22s | |
| 17m 29s | |
| 21m 27s | |
| 22m 39s | |
| 20m 36s | |
| 2h 56m 13s |