-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Feature request
Hi,
I'd like to propose adding VESPO (https://huggingface.co/papers/2602.10693) as a supported loss_type in GRPOTrainer
It's quite distinct from GRPO and closer to SAPO, with a smooth trust region.
-
Instead of differentiating through the IS ratio, the gradient relies directly on the current logprobs (like vanilla policy gradient). The IS ratio, instead, is passed through a nonlinear function, detached, as a pure scaling factor.
-
Advantages calculation is the same as GRPO.
-
Compatible with TIS/MIS.
The main highlight, imo, is nice/stable performance under bad policy staleness and training-inference mismatch, especially with MoEs.
I wanted to know, first, if there is an interest to have this variant in TRL before opening a PR
Motivation
I thought you might be interested in this one @qgallouedec because they're getting quite nice results with sparse architectures specifically, even without routing replay, see some handpicked pics below:
Your contribution
yes