Skip to content

weizhepei/WebAgent-R1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

WebAgent-R1

WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning

WeArena-Lite Benchmark

Comparison between existing methods and our WebAgent-R1 on the WebArena-Lite benchmark. The proposed method outperforms both strong prompting-based and finetuned baselines, achieving superior performance across various model sizes.

benchmark

WebAgent-R1

  • Top: Overview of the end-to-end multi-turn RL training framework used in WebAgent-R1.
  • Bottom: An input/output example of agent–web interaction at the k-th step. The interaction continues until either the maximum number of steps is reached or the agent generates an exit() action to signal task completion.

overview

Training Dynamics

Training dynamics during RL, including rewards, trajectory length, and number of interactions. As indicated by the dashed vertical lines in the figure, the entire process can be broadly divided into three phases: (1) initial skill acquisition, (2) exploration for policy refinement, and (3) final policy stabilization.

image

Ablation study on RL initialization policy

We compared WebAgent-R1 (R1) with two variants:

  • WebAgent-R1-Zero (R1-Zero), initialized from an off-the-shelf model without SFT
  • WebAgent-R1-CoT (R1-CoT), initialized from an SFT model trained with long chain-of-thought (CoT) data during behavior cloning.

We report task success rate, single-turn response length, and number of interactions, evaluated both before and after applying RL.

image

Analysis

Promoting w/ Thinking vs. w/ Non-Thinking Format

Analysis of prompting design. We report the average success rate (SR), single-turn response length, and number of interactions. The result reveals a novel test-time scaling paradigm by increasing the number of interactions for multi-turn interactive web tasks.

image

Test-time scaling through increased interactions

Analysis of test-time scaling with increased max number of interactions. Allowing more interactions enables the web agent to produce longer trajectories and consistently improves the success rate.

image

Case Study

Model Output

Comparison of model outputs from WebAgent-R1 and WebAgent-R1-CoT. We present successful trajectories from both models on the same task (What are the top-3 best-selling products in Jan 2023?), showing only the first two steps for clarity (the entire trajectory is shown in the subsequent web screenshots for full context).

Compared to WebAgent-R1, the long-CoT variant WebAgent-R1-CoT exhibits a more detailed thinking process.

image

Real-world Example

A real-world example of a successful trajectory generated by WebAgent-R1 on the task: What are the top-3 best-selling products in Jan 2023?

image

About

[EMNLP 2025] WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published