How can I control the on-policy and off-policy degree of GRPO? #4716

FuRuF-11 · 2025-12-18T09:53:06Z

FuRuF-11
Dec 18, 2025

I am studying how to use TRL to apply GRPO on LLMs.

And when I see through the doc, I notice that there is no micro_batch arg in GRPOConfig, which is usually used to control the on-policy and off-policy degree of PPO-like algorithm(GRPO\DAPO...) through importance sampling.

As a reference, there is a ppo_micro_batch_size_per_gpu in verl to control that.

So my question is how can I control the on-policy and off-policy degree of GRPO?

Answered by FuRuF-11

Dec 18, 2025

After reviewing the doc again, I found that num_iterations is the parameter I need.

View full answer

FuRuF-11 · 2025-12-18T13:20:58Z

FuRuF-11
Dec 18, 2025
Author

After reviewing the doc again, I found that num_iterations is the parameter I need.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I control the on-policy and off-policy degree of GRPO? #4716

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How can I control the on-policy and off-policy degree of GRPO? #4716

Uh oh!

FuRuF-11 Dec 18, 2025

Replies: 1 comment

Uh oh!

FuRuF-11 Dec 18, 2025 Author

FuRuF-11
Dec 18, 2025

FuRuF-11
Dec 18, 2025
Author