-
|
I am studying how to use TRL to apply GRPO on LLMs. And when I see through the doc, I notice that there is no micro_batch arg in GRPOConfig, which is usually used to control the on-policy and off-policy degree of PPO-like algorithm(GRPO\DAPO...) through importance sampling. As a reference, there is a ppo_micro_batch_size_per_gpu in verl to control that. So my question is how can I control the on-policy and off-policy degree of GRPO? |
Beta Was this translation helpful? Give feedback.
Answered by
FuRuF-11
Dec 18, 2025
Replies: 1 comment
-
|
After reviewing the doc again, I found that num_iterations is the parameter I need. |
Beta Was this translation helpful? Give feedback.
0 replies
Answer selected by
FuRuF-11
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
After reviewing the doc again, I found that num_iterations is the parameter I need.