Skip to content

feat: Variational Sequence-Level Soft Policy Optimization (VESPO) #5196

@casinca

Description

@casinca

Feature request

Hi,

I'd like to propose adding VESPO (https://huggingface.co/papers/2602.10693) as a supported loss_type in GRPOTrainer

It's quite distinct from GRPO and closer to SAPO, with a smooth trust region.

  • Instead of differentiating through the IS ratio, the gradient relies directly on the current logprobs (like vanilla policy gradient). The IS ratio, instead, is passed through a nonlinear function, detached, as a pure scaling factor.

  • Advantages calculation is the same as GRPO.

  • Compatible with TIS/MIS.

The main highlight, imo, is nice/stable performance under bad policy staleness and training-inference mismatch, especially with MoEs.

 

I wanted to know, first, if there is an interest to have this variant in TRL before opening a PR

Motivation

I thought you might be interested in this one @qgallouedec because they're getting quite nice results with sparse architectures specifically, even without routing replay, see some handpicked pics below:

Image

Your contribution

yes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions