Training Agents with GRPO #2704

August-murr · 2025-01-30T18:28:08Z

August-murr
Jan 30, 2025

I don't think there's a need to explain the motives, seems pretty clear that agents and RL are gonna be the force pushing the industry.

so let's discuss how we can combine smolagents and TRL to train an Agent with GRPO.

@aymeric-roucher has been working on creating an R1 Agent, But an alternative would be to fine-tune a smaller distilled R1 Agent with GRPO.

at its current state, smolagents has some issues that need to be resolved first:

There's no clear way to set the sampling parameters, like temperature. at least it wasn't in the docs.

smolagents Agents need to generate multiple responses in parallel, also not possible as of now, or not clarified in the docs.

it would also be ideal to have smolagents that work with vllm.

I'd like to ask @aymeric-roucher to comment and clarify.

the rest would be modifying the compute_loss in GRPOTrainer and wrapping the model as a smolagent agent.
along with a reward function that rewards ground truth and other things.

Am I missing something? is there's more to it?

what do you all think?

aymeric-roucher · 2025-01-31T18:46:55Z

aymeric-roucher
Jan 31, 2025

@August-murr thank your for your points!

Re: temperature, you can set this as a kwarg, as in OpenAIServerModel(model_id=your_model_id, temperature=0.5)
vLLM: this will be merged in main very soon, probably before Monday.
Don't hesitat if you need something to make progress, I can make quick changes in the lib!

2 replies

August-murr Jan 31, 2025
Author

the most important one is being able to generate multiple responses in parallel for the Agent similar to:

model.generate(**inputs,num_return_sequences=16)

either with Transformers or vllm, which is necessary for GRPO.

aymeric-roucher Jan 31, 2025

Not sure that you could really do this. Since Agents use tool calls, you couldn't use optimizations like dynamic batching. The best way I see is to run agents in parallel with a Threadpoolexecutor, would that be enough?

August-murr · 2025-02-08T20:07:48Z

August-murr
Feb 8, 2025
Author

@aymeric-roucher
I'm trying to figure out how to generate the conversation data from an agent run so that it can be scored with the reward function for GRPO.

When using Agents, what’s the best way to structure the conversation—including tool use and code execution—so it matches how SmolAgents expects input at inference?

maybe create the dataset out of:

final_asnwer = agent.run(query)
agent.memory.get_full_steps()

When using the agent, inference can happen in multiple steps. If we combine all steps into one conversation using agent.memory.get_full_steps(), would that actually match what the agent sees at inference and train it correctly?

2 replies

aymeric-roucher Feb 10, 2025

@August-murr what the LLM within the agent sees at inference time is the result of agent.write_memory_to_messages(), also stored in each ActionStep in memory under attribute step.model_input_messages

August-murr Feb 10, 2025
Author

I think I'll proceed with a simplified version without smolagents.
since a trained agent learns to use tools and strategize on its own in the environment, all the prompts used by smolagents will end up slowing the agent down, making things overly complicated.
after a part of this is done maybe we'll make a user interface wrapper and integrate it into smolagents.

willccbb · 2025-02-09T17:56:22Z

willccbb
Feb 9, 2025

Opened a PR that should allow for fairly generic rollout protocols. Users can just pass an Environment object which wraps the vLLM generate step (returning completion_ids), with whatever multi-step logic is desired. In general these should take advantage of batching messages at each step in order to be most efficient, an example is given in the PR. Explored allowing AsyncLLMEngine to be used, which would make writing Environments a bit simpler, but switching LLM to AsyncLLMEngine doesn't play nicely with the rest of the training code + is probably not worth the headache. Curious to hear any thoughts!

#2810

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training Agents with GRPO #2704

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Training Agents with GRPO #2704

Uh oh!

Uh oh!

August-murr Jan 30, 2025

Replies: 3 comments · 4 replies

Uh oh!

aymeric-roucher Jan 31, 2025

Uh oh!

August-murr Jan 31, 2025 Author

Uh oh!

aymeric-roucher Jan 31, 2025

Uh oh!

August-murr Feb 8, 2025 Author

Uh oh!

aymeric-roucher Feb 10, 2025

Uh oh!

August-murr Feb 10, 2025 Author

Uh oh!

willccbb Feb 9, 2025

August-murr
Jan 30, 2025

Replies: 3 comments 4 replies

aymeric-roucher
Jan 31, 2025

August-murr Jan 31, 2025
Author

August-murr
Feb 8, 2025
Author

August-murr Feb 10, 2025
Author

willccbb
Feb 9, 2025