Add sequence mask for grpo #420

garyzhang99 · 2025-12-02T06:52:24Z

Description

Here we implement the sequence masking formula introduced in DeepSeek V3.2 paper

Since the $\delta$ hyperparameter is not given a recommended value in the original paper, we further tested the influence of $\delta$. It seems that setting the range from 0.05 to 0.2 is reasonable.

Checklist

Please check the following items before code is ready to be reviewed.

Code has passed all tests
Docstrings have been added/updated in Google Style
Documentation has been updated
Code is ready for review

gemini-code-assist · 2025-12-02T06:52:37Z

Summary of Changes

Hello @garyzhang99, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new sequence masking mechanism to the PPO policy loss calculation, drawing inspiration from the DeepSeek V3.2 paper. The primary goal is to refine the loss computation by conditionally ignoring certain tokens, potentially leading to more robust or efficient model training. The implementation includes new configuration options and diagnostic metrics to observe its effect, with further testing planned for the delta hyperparameter.

Highlights

Sequence Masking for PPO: Implemented sequence masking for the PPO policy loss function, based on the formula introduced in the DeepSeek V3.2 paper. This feature aims to improve training stability or performance by selectively masking tokens during loss calculation.
Configurable Parameters: Added enable_sequence_masking (boolean) and delta (float) parameters to the PPOPolicyLoss constructor. The delta hyperparameter controls the threshold for sequence-level KL divergence, influencing when masking is applied.
New Metrics: Introduced new metrics, seq_mask/masked_tokens and seq_mask/mean_sequence_kl, to monitor the effect of sequence masking during training. These metrics provide insights into how many tokens are being masked and the average sequence KL divergence.
Dedicated Test Case: Added a new test test_ppo_policy_loss_with_sequence_masking to verify the correct functionality of the sequence masking logic, ensuring that metrics are present and that the loss differs when masking occurs.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request implements sequence masking for the PPO policy loss, based on the formula from the DeepSeek V3.2 paper. The implementation is well-structured and includes a corresponding test case to validate the new functionality. My review focuses on improving code consistency and ensuring the correctness of the calculated metrics. I've suggested using an existing utility function for consistency and have pointed out a small correction needed in the masked_tokens metric calculation to ensure it only considers valid tokens.

trinity/algorithm/policy_loss_fn/ppo_policy_loss.py

garyzhang99 · 2025-12-02T09:02:23Z

/unittest-module-algorithm

github-actions · 2025-12-02T09:04:32Z

Summary

Tests 📝	Passed ✅	Failed ❌	Skipped ⏭️	Other ❓	Flaky 🍂	Duration ⏱️
15	15	0	0	0	0	18.3s

Tests

Test Name	Status	Duration
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_batch_level_std_grpo	✅	41ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_batch_level_step_wise_grpo_advantage	✅	2ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_duplicate_grpo	✅	5ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_grpo_advantage	✅	3ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_grpo_correct_bias	✅	2ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_grpo_reward_std	✅	1ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_step_wise_grpo_advantage	✅	2ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_step_wise_grpo_with_std_threshold	✅	2ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_dpo_policy_loss	✅	2ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_gspo_policy_loss	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_mix_policy_loss	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_opmd_policy_loss	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_ppo_policy_loss	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_ppo_policy_loss_with_sequence_masking	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_sft_policy_loss	✅	1ms

Github Test Reporter by CTRF 💚

trinity/algorithm/policy_loss_fn/ppo_policy_loss.py

garyzhang99 · 2025-12-05T03:46:29Z

/unittest-module-algorithm

github-actions · 2025-12-05T03:48:45Z

Summary

Tests 📝	Passed ✅	Failed ❌	Skipped ⏭️	Other ❓	Flaky 🍂	Duration ⏱️
27	27	0	0	0	0	18.3s

Tests

Test Name	Status	Duration
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_batch_level_std_grpo	✅	40ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_batch_level_step_wise_grpo_advantage	✅	2ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_duplicate_grpo	✅	5ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_grpo_advantage	✅	3ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_grpo_correct_bias	✅	2ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_grpo_reward_std	✅	1ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_step_wise_grpo_advantage	✅	2ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_step_wise_grpo_with_std_threshold	✅	2ms
tests/algorithm/kl_fn_test.py::KLFnTest::test_abs_kl_fn	✅	1ms
tests/algorithm/kl_fn_test.py::KLFnTest::test_corrected_k3_fallback	✅	1ms
tests/algorithm/kl_fn_test.py::KLFnTest::test_corrected_k3_loss	✅	1ms
tests/algorithm/kl_fn_test.py::KLFnTest::test_corrected_k3_same_policy	✅	1ms
tests/algorithm/kl_fn_test.py::KLFnTest::test_corrected_k3_with_old_logprob	✅	1ms
tests/algorithm/kl_fn_test.py::KLFnTest::test_dummy_kl_fn	✅	1ms
tests/algorithm/kl_fn_test.py::KLFnTest::test_k1_kl_fn	✅	1ms
tests/algorithm/kl_fn_test.py::KLFnTest::test_k2_kl_fn	✅	1ms
tests/algorithm/kl_fn_test.py::KLFnTest::test_k3_kl_fn	✅	1ms
tests/algorithm/kl_fn_test.py::KLFnTest::test_kl_loss_aggregation_modes	✅	1ms
tests/algorithm/kl_fn_test.py::KLFnTest::test_low_var_kl_fn	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_dpo_policy_loss	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_gspo_policy_loss	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_mix_policy_loss	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_opmd_policy_loss	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_ppo_policy_loss	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_ppo_policy_loss_with_sequence_masking	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_sapo_policy_loss	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_sft_policy_loss	✅	1ms

Github Test Reporter by CTRF 💚

add sequence mask for grpo

221b075

gemini-code-assist bot reviewed Dec 2, 2025

View reviewed changes

trinity/algorithm/policy_loss_fn/ppo_policy_loss.py Show resolved Hide resolved

trinity/algorithm/policy_loss_fn/ppo_policy_loss.py Show resolved Hide resolved

udpate policy loss value test

6400b5d

garyzhang99 changed the title ~~[WIP]Add sequence mask for grpo~~ Add sequence mask for grpo Dec 3, 2025

garyzhang99 and others added 3 commits December 3, 2025 19:37

Merge branch 'main' into dev/scalable_grpo

96a0820

fix pre commit after merge main

a2ce985

not redundent compute

e221c8a

yanxi-chen reviewed Dec 4, 2025

View reviewed changes

clean code as suggested

7763bdd

pan-x-c approved these changes Dec 5, 2025

View reviewed changes

pan-x-c merged commit 613194d into agentscope-ai:main Dec 5, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add sequence mask for grpo #420

Add sequence mask for grpo #420

Uh oh!

garyzhang99 commented Dec 2, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Dec 2, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

garyzhang99 commented Dec 2, 2025

Uh oh!

github-actions bot commented Dec 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

garyzhang99 commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add sequence mask for grpo #420

Add sequence mask for grpo #420

Uh oh!

Conversation

garyzhang99 commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

gemini-code-assist bot commented Dec 2, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

garyzhang99 commented Dec 2, 2025

Uh oh!

github-actions bot commented Dec 2, 2025

Summary

Tests

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

garyzhang99 commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 5, 2025

Summary

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

garyzhang99 commented Dec 2, 2025 •

edited

Loading