feat: Variational Sequence-Level Soft Policy Optimization (VESPO)

### Feature request

Hi,

I'd like to propose adding VESPO (https://huggingface.co/papers/2602.10693) as a supported `loss_type` in `GRPOTrainer`

It's quite distinct from GRPO and closer to SAPO, with a smooth trust region.

- Instead of differentiating through the IS ratio, the gradient relies directly on the current logprobs (like vanilla policy gradient). The IS ratio, instead, is passed through a nonlinear function, detached, as a pure scaling factor.

- Advantages calculation is the same as GRPO. 
- Compatible with TIS/MIS.

The main highlight, imo, is nice/stable performance under bad policy staleness and training-inference mismatch, especially with MoEs.

&nbsp;

I wanted to know, first, if there is an interest to have this variant in TRL before opening a PR

### Motivation

I thought you might be interested in this one @qgallouedec  because they're getting quite nice results with sparse architectures specifically, even without routing replay, see some handpicked pics below:

<img width="1189" height="689" alt="Image" src="https://github.com/user-attachments/assets/49a4ef9e-8761-4c57-b8ec-d5c2de3a9989" />

### Your contribution

yes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Variational Sequence-Level Soft Policy Optimization (VESPO) #5196

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: Variational Sequence-Level Soft Policy Optimization (VESPO) #5196

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions