Is it possible to apply GRPOTrainer directly to handle interactive environments? #352
Some-random
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Say for example agentic environments, where some parts of the trajectory shouldn't participate in gradient calculation (e.g. environment responses). I'm not sure whether this is supported currently as I think the trainer assumes all tokens in a sequence are generated by the model and should be part of gradient computation.
Beta Was this translation helpful? Give feedback.
All reactions