WebAgent-R1

WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning

WeArena-Lite Benchmark

Comparison between existing methods and our WebAgent-R1 on the WebArena-Lite benchmark. The proposed method outperforms both strong prompting-based and finetuned baselines, achieving superior performance across various model sizes.

WebAgent-R1

Top: Overview of the end-to-end multi-turn RL training framework used in WebAgent-R1.
Bottom: An input/output example of agent–web interaction at the k-th step. The interaction continues until either the maximum number of steps is reached or the agent generates an exit() action to signal task completion.

Training Dynamics

Training dynamics during RL, including rewards, trajectory length, and number of interactions. As indicated by the dashed vertical lines in the figure, the entire process can be broadly divided into three phases: (1) initial skill acquisition, (2) exploration for policy refinement, and (3) final policy stabilization.

Ablation study on RL initialization policy

We compared WebAgent-R1 (R1) with two variants:

WebAgent-R1-Zero (R1-Zero), initialized from an off-the-shelf model without SFT
WebAgent-R1-CoT (R1-CoT), initialized from an SFT model trained with long chain-of-thought (CoT) data during behavior cloning.

We report task success rate, single-turn response length, and number of interactions, evaluated both before and after applying RL.

Analysis

Promoting w/ Thinking vs. w/ Non-Thinking Format

Analysis of prompting design. We report the average success rate (SR), single-turn response length, and number of interactions. The result reveals a novel test-time scaling paradigm by increasing the number of interactions for multi-turn interactive web tasks.

Test-time scaling through increased interactions

Analysis of test-time scaling with increased max number of interactions. Allowing more interactions enables the web agent to produce longer trajectories and consistently improves the success rate.

Case Study

Model Output

Comparison of model outputs from WebAgent-R1 and WebAgent-R1-CoT. We present successful trajectories from both models on the same task (What are the top-3 best-selling products in Jan 2023?), showing only the first two steps for clarity (the entire trajectory is shown in the subsequent web screenshots for full context).

Compared to WebAgent-R1, the long-CoT variant WebAgent-R1-CoT exhibits a more detailed thinking process.

Real-world Example

A real-world example of a successful trajectory generated by WebAgent-R1 on the task: What are the top-3 best-selling products in Jan 2023?

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
WebAgent-R1		WebAgent-R1
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WebAgent-R1

WeArena-Lite Benchmark

WebAgent-R1

Training Dynamics

Ablation study on RL initialization policy

Analysis

Promoting w/ Thinking vs. w/ Non-Thinking Format

Test-time scaling through increased interactions

Case Study

Model Output

Real-world Example

About

Uh oh!

Releases

Packages

Languages

License

weizhepei/WebAgent-R1

Folders and files

Latest commit

History

Repository files navigation

WebAgent-R1

WeArena-Lite Benchmark

WebAgent-R1

Training Dynamics

Ablation study on RL initialization policy

Analysis

Promoting w/ Thinking vs. w/ Non-Thinking Format

Test-time scaling through increased interactions

Case Study

Model Output

Real-world Example

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages