Training Agents with GRPO #2704
Replies: 3 comments 4 replies
-
@August-murr thank your for your points!
|
Beta Was this translation helpful? Give feedback.
-
@aymeric-roucher When using Agents, what’s the best way to structure the conversation—including tool use and code execution—so it matches how SmolAgents expects input at inference? maybe create the dataset out of: final_asnwer = agent.run(query)
agent.memory.get_full_steps() When using the agent, inference can happen in multiple steps. If we combine all steps into one conversation using |
Beta Was this translation helpful? Give feedback.
-
Opened a PR that should allow for fairly generic rollout protocols. Users can just pass an Environment object which wraps the vLLM generate step (returning |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I don't think there's a need to explain the motives, seems pretty clear that agents and RL are gonna be the force pushing the industry.
so let's discuss how we can combine smolagents and TRL to train an Agent with GRPO.
@aymeric-roucher has been working on creating an R1 Agent, But an alternative would be to fine-tune a smaller distilled R1 Agent with GRPO.
at its current state, smolagents has some issues that need to be resolved first:
There's no clear way to set the sampling parameters, like temperature. at least it wasn't in the docs.
smolagents Agents need to generate multiple responses in parallel, also not possible as of now, or not clarified in the docs.
it would also be ideal to have smolagents that work with vllm.
I'd like to ask @aymeric-roucher to comment and clarify.
the rest would be modifying the
compute_loss
inGRPOTrainer
and wrapping the model as a smolagent agent.along with a reward function that rewards ground truth and other things.
Am I missing something? is there's more to it?
what do you all think?
Beta Was this translation helpful? Give feedback.
All reactions