Skip to content

Conversation

@ebronstein
Copy link
Contributor

The GRPO advantage computation returns the normalized scores for both the advantages and returns even though it represents the advantages. This PR makes compute_grpo_outcome_advantage return the sequence-level sum of rewards for the returns, separately from the scores (i.e., advantages).

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly separates the calculation of advantages and returns in compute_grpo_outcome_advantage. Previously, the function returned the computed advantages for both values. With this change, it now correctly returns the original sequence-level rewards as returns.

However, there is a critical issue with the shape of the returned returns tensor, which is currently 1D (batch_size,) instead of the expected 2D (batch_size, seq_len). I've left a specific comment with a suggested fix to address this.

Additionally, the unit tests for compute_grpo_outcome_advantage in skyrl-train/tests/cpu/utils/test_ppo_utils.py have not been updated to reflect these changes. The existing tests contain assertions that are now incorrect (e.g., asserting that advantages and returns are equal), and they would fail to catch the shape mismatch. Please update these tests to validate the new behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant