-
Notifications
You must be signed in to change notification settings - Fork 545
Description
Disclaimer: I am not the author of the paper, but I believe this technique offers a high-value, low-cost improvement for the library.
What is F-GRPO and why using this
Paper F-GRPO proposes to down-weight the gradient signal from prompts where the model is already highly successful in RL, preventing the optimizer from "over-fitting" to easy solutions at the expense of exploration. This is inspired by focal loss in image classification.
The motivation is that for practical group size (16, for instance):
-
The gradient signal can be dominated by "common correct traj's", so down-weighting the graident signal may help reduce the "sharpening" effect in output distribution of rl (maybe good for exploration).
-
this gradient domination may hurt gradient signal from correct trajectories for hard prompts (not common in low group size), so want to "protect" that gradient signal which may be hurt by the dominating gradient signal from easy tasks.
The key logic behind this is gradient competing, that is, accepting gradient signal from some samples will reduce the prob of other samples.
Experiment results in the paper
- Mainly on small models. Better Pass@K with large K (indicating less sharpening effect).
Need discussions
-
I want to run more experiments to verify whether this is a good technique to use:
- On larger models like Qwen30B-A3B
- On what training recipe? Using more simple but verified training recipe like JustRL?
- On more tasks/data? If so, what tasks/data?
However, I am a new contributor to this repo. So I would like to know if I want to verify whether this feature should be introduced in Slime, what experiments should be ran? Any other comments?
My Contributions
- Run more experiments.
- Implement the feature if the maintainers agree on intergrating F-GRPO
Thanks!