Keep returns separate from advantages in GRPO #773

ebronstein · 2025-12-11T19:21:40Z

The GRPO advantage computation returns the normalized scores for both the advantages and returns even though it represents the advantages. This PR makes compute_grpo_outcome_advantage return the sequence-level sum of rewards for the returns, separately from the scores (i.e., advantages).

gemini-code-assist

Code Review

This pull request correctly separates the calculation of advantages and returns in compute_grpo_outcome_advantage. Previously, the function returned the computed advantages for both values. With this change, it now correctly returns the original sequence-level rewards as returns.

However, there is a critical issue with the shape of the returned returns tensor, which is currently 1D (batch_size,) instead of the expected 2D (batch_size, seq_len). I've left a specific comment with a suggested fix to address this.

Additionally, the unit tests for compute_grpo_outcome_advantage in skyrl-train/tests/cpu/utils/test_ppo_utils.py have not been updated to reflect these changes. The existing tests contain assertions that are now incorrect (e.g., asserting that advantages and returns are equal), and they would fail to catch the shape mismatch. Please update these tests to validate the new behavior.

skyrl-train/skyrl_train/utils/ppo_utils.py

Keep returns separate from advantages in GRPO.

6f5afbe

gemini-code-assist bot reviewed Dec 11, 2025

View reviewed changes

skyrl-train/skyrl_train/utils/ppo_utils.py Show resolved Hide resolved

Make returns the correct shape and mask with response mask.

c6f3c4e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Keep returns separate from advantages in GRPO #773

Keep returns separate from advantages in GRPO #773

Uh oh!

ebronstein commented Dec 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Keep returns separate from advantages in GRPO #773

Are you sure you want to change the base?

Keep returns separate from advantages in GRPO #773

Uh oh!

Conversation

ebronstein commented Dec 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant