WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning
Comparison between existing methods and our WebAgent-R1 on the WebArena-Lite benchmark. The proposed method outperforms both strong prompting-based and finetuned baselines, achieving superior performance across various model sizes.
- Top: Overview of the end-to-end multi-turn RL training framework used in WebAgent-R1.
- Bottom: An input/output example of agent–web interaction at the k-th step. The interaction continues until either the maximum number of steps is reached or the agent generates an
exit()
action to signal task completion.
Training dynamics during RL, including rewards, trajectory length, and number of interactions. As indicated by the dashed vertical lines in the figure, the entire process can be broadly divided into three phases: (1) initial skill acquisition, (2) exploration for policy refinement, and (3) final policy stabilization.
We compared WebAgent-R1 (R1) with two variants:
- WebAgent-R1-Zero (R1-Zero), initialized from an off-the-shelf model without SFT
- WebAgent-R1-CoT (R1-CoT), initialized from an SFT model trained with long chain-of-thought (CoT) data during behavior cloning.
We report task success rate, single-turn response length, and number of interactions, evaluated both before and after applying RL.
Analysis of prompting design. We report the average success rate (SR), single-turn response length, and number of interactions. The result reveals a novel test-time scaling paradigm by increasing the number of interactions for multi-turn interactive web tasks.
Analysis of test-time scaling with increased max number of interactions. Allowing more interactions enables the web agent to produce longer trajectories and consistently improves the success rate.
Comparison of model outputs from WebAgent-R1 and WebAgent-R1-CoT. We present successful trajectories from both models on the same task (What are the top-3 best-selling products in Jan 2023?
), showing only the first two steps for clarity (the entire trajectory is shown in the subsequent web screenshots for full context).
Compared to WebAgent-R1, the long-CoT variant WebAgent-R1-CoT exhibits a more detailed thinking process.
A real-world example of a successful trajectory generated by WebAgent-R1 on the task: What are the top-3 best-selling products in Jan 2023?