This is the Official repo for the survey paper: The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
ArXiv – https://arxiv.org/abs/2509.02547
HuggingFace – https://huggingface.co/papers/2509.02547
@misc{zhang2025landscapeagenticreinforcementlearning,
title={The Landscape of Agentic Reinforcement Learning for LLMs: A Survey},
author={Guibin Zhang and Hejia Geng and Xiaohang Yu and Zhenfei Yin and Zaibin Zhang and Zelin Tan and Heng Zhou and Zhongzhi Li and Xiangyuan Xue and Yijiang Li and Yifan Zhou and Yang Chen and Chen Zhang and Yutao Fan and Zihu Wang and Songtao Huang and Yue Liao and Hongru Wang and Mengyue Yang and Heng Ji and Michael Littman and Jun Wang and Shuicheng Yan and Philip Torr and Lei Bai},
year={2025},
eprint={2509.02547},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2509.02547},
}Clip corresponds to preventing the policy ratio from moving too far from 1 for ensuring stable updates.
KL penalty corresponds to penalizing the KL divergence between the learned policy and the reference policy for ensuring alignment.
| Method | Year | Objective Type | Clip | KL Penalty | Key Mechanism | Signal | Link | Resource |
|---|---|---|---|---|---|---|---|---|
| PPO family | ||||||||
| PPO | 2017 | Policy gradient | Yes | No | Policy ratio clipping | Reward | Paper | - |
| VAPO | 2025 | Policy gradient | Yes | Adaptive | Adaptive KL penalty + variance control | Reward + variance signal | Paper | - |
| PF-PPO | 2024 | Policy gradient | Yes | Yes | Policy filtration | Noisy reward | Paper | Code |
| VinePPO | 2024 | Policy gradient | Yes | Yes | Unbiased value estimates | Reward | Paper | Code |
| PSGPO | 2024 | Policy gradient | Yes | Yes | Process supervision | Process Reward | Paper | - |
| DPO family | ||||||||
| DPO | 2024 | Preference optimization | No | Yes | Implicit reward related to the policy | Human preference | Paper | - |
| β-DPO | 2024 | Preference optimization | No | Adaptive | Dynamic KL coefficient | Human preference | Paper | Code |
| SimPO | 2024 | Preference optimization | No | Scaled | Use avg log-prob of a sequence as implicit reward | Human preference | Paper | Code |
| IPO | 2024 | Implicit preference | No | No | LLMs as preference classifiers | Preference rank | Paper | - |
| KTO | 2024 | Knowledge transfer optimization | No | Yes | Teacher stabilization | Teacher-student logit | Paper | Code Model |
| ORPO | 2024 | Online regularized preference optimization | No | Yes | Online stabilization | Online feedback reward | Paper | Code Model |
| Step-DPO | 2024 | Preference optimization | No | Yes | Step-wise supervision | Step-wise preference | Paper | Code Model |
| LCPO | 2025 | Preference optimization | No | Yes | Length preference with limited data/training | Reward | Paper | - |
| GRPO family | ||||||||
| GRPO | 2025 | Policy gradient under group-based reward | Yes | Yes | Group-based relative reward to eliminate value estimates | Group-based reward | Paper | - |
| DAPO | 2025 | Surrogate of GRPO's | Yes | Yes | Decoupled clip + dynamic sampling | Dynamic group-based reward | Paper | Code Model Website |
| GSPO | 2025 | Surrogate of GRPO's | Yes | Yes | Sequence-level clipping, rewarding, optimization | Smooth group-based reward | Paper | - |
| GMPO | 2025 | Surrogate of GRPO's | Yes | Yes | Geometric mean of token-level rewards | Margin-based reward | Paper | Code |
| ProRL | 2025 | Same as GRPO's | Yes | Yes | Reference policy reset | Group-based reward | Paper | Model |
| Posterior-GRPO | 2025 | Same as GRPO's | Yes | Yes | Reward only successful processes | Process-based reward | Paper | - |
| Dr.GRPO | 2025 | Unbiased GRPO objective | Yes | Yes | Eliminate bias in optimization | Group-based reward | Paper | Code Model |
| Step-GRPO | 2025 | Same as GRPO's | Yes | Yes | Rule-based reasoning rewards | Step-wise reward | Paper | Code Model |
| SRPO | 2025 | Same as GRPO's | Yes | Yes | Two-staged history-resampling | Reward | Paper | Model |
| GRESO | 2025 | Same as GRPO's | Yes | Yes | Pre-rollout filtering | Reward | Paper | Code Website |
| StarPO | 2025 | Same as GRPO's | Yes | Yes | Reasoning-guided actions for multi-turn interactions | Group-based reward | Paper | Code Website |
| GHPO | 2025 | Policy gradient | Yes | Yes | Adaptive prompt refinement | Reward | Paper | Code |
| Skywork R1V2 | 2025 | GRPO with hybrid reward signal | Yes | Yes | Selective sample buffer | Multimodal reward | Paper | Code Model |
| ASPO | 2025 | GRPO with shaped advantage | Yes | Yes | Clipped bias to advantage | Group-based reward | Paper | Code Model |
| TreePo | 2025 | Same as GRPO's | Yes | Yes | Self-guided rollout, reduced compute burden | Group-based reward | Paper | Code Model Website |
| EDGE-GRPO | 2025 | Same as GRPO's | Yes | Yes | Entropy-driven advantage + error correction | Group-based reward | Paper | Code Model |
| DARS | 2025 | Same as GRPO's | Yes | No | Multi-stage rollout for hardest problems | Group-based reward | Paper | Code Model |
| CHORD | 2025 | Weighted GRPO + SFT | Yes | Yes | Auxiliary supervised loss | Group-based reward | Paper | Code |
| PAPO | 2025 | Surrogate of GRPO's | Yes | Yes | Implicit Perception Loss | Group-based reward | Paper | Code Model Website |
| Pass@k Training | 2025 | Same as GRPO's | Yes | Yes | Pass@k metric as reward | Group-based reward | Paper | Code |
| Method | Category | Base LLM | Link | Resource |
|---|---|---|---|---|
| Open Source Methods | ||||
| DeepRetrieval | External | Qwen2.5-3B-Instruct, Llama-3.2-3B-Instruct | Paper | Code |
| Search-R1 | External | Qwen2.5-3B/7B-Base/Instruct | Paper | Code |
| R1-Searcher | External | Qwen2.5-7B, Llama3.1-8B-Instruct | Paper | Code |
| R1-Searcher++ | External | Qwen2.5-7B-Instruct | Paper | Code |
| ReSearch | External | Qwen2.5-7B/32B-Instruct | Paper | Code |
| StepSearch | External | Qwen2.5-3B/7B-Base/Instruct | Paper | Code |
| WebDancer | External | Qwen2.5-7B/32B, QWQ-32B | Paper | Code |
| WebThinker | External | QwQ-32B, DeepSeek-R1-Distilled-Qwen-7B/14B/32B, Qwen2.5-32B-Instruct | Paper | Code |
| WebSailor | External | Qwen2.5-3B/7B/32B/72B | Paper | Code |
| WebWatcher | External | Qwen2.5-VL-7B/32B | Paper | Code |
| ASearcher | External | Qwen2.5-7B/14B, QwQ-32B | Paper | Code |
| ZeroSearch | Internal | Qwen2.5-3B/7B-Base/Instruct | Paper | Code |
| SSRL | Internal | Qwen2.5-1.5B/3B/7B/14B/32B/72B-Instruct, Llama-3.2-1B/8B-Instruct, Llama-3.1-8B/70B-Instruct, Qwen3-0.6B/1.7B/4B/8B/14B/32B | Paper | Code |
| Closed Source Methods | ||||
| OpenAI Deep Research | External | OpenAI Models | Blog | Website |
| Perplexity’s DeepResearch | External | - | Blog | Website |
| Google Gemini’s DeepResearch | External | Gemini | Blog | Website |
| Kimi-Researcher | External | Kimi K2 | Blog | Website |
| Grok AI DeepSearch | External | Grok3 | Blog | Website |
| Doubao with Deep Think | External | Doubao | Blog | Website |
| Method | RL Reward Type | Base LLM | Link | Resource |
|---|---|---|---|---|
| RL for Code Generation | ||||
| AceCoder | Outcome | Qwen2.5-Coder-7B-Base/Instruct, Qwen2.5-7B-Instruct | Paper | Code |
| DeepCoder-14B | Outcome | Deepseek-R1-Distilled-Qwen-14B | Blog | Code |
| RLTF | Outcome | CodeGen-NL 2.7B, CodeT5 | Paper | Code |
| CURE | Outcome | Qwen2.5-7B/14B-Instruct, Qwen3-4B | Paper | Code |
| Absolute Zero | Outcome | Qwen2.5-7B/14B, Qwen2.5-Coder-3B/7B/14B, Llama-3.1-8B | Paper | Code |
| StepCoder | Process | DeepSeek-Coder-Instruct-6.7B | Paper | Code |
| Process Supervision-Guided PO | Process | - | Paper | - |
| CodeBoost | Process | Qwen2.5-Coder-7B-Instruct, Llama-3.1-8B-Instruct, Seed-Coder-8B-Instruct, Yi-Coder-9B-Chat | Paper | Code |
| PRLCoder | Process | CodeT5+, Unixcoder, T5-base | Paper | - |
| o1-Coder | Process | DeepSeek-1.3B-Instruct | Paper | Code |
| CodeFavor | Process | Mistral-NeMo-12B-Instruct, Gemma-2-9B-Instruct, Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3 | Paper | Code |
| Focused-DPO | Process | DeepSeek-Coder-6.7B-Base/Instruct, Magicoder-S-DS-6.7B, Qwen2.5-Coder-7B-Instruct | Paper | - |
| RL for Iterative Code Refinement | ||||
| RLEF | Outcome | Llama-3.0-8B-Instruct, Llama-3.1-8B/70B-Instruct | Paper | - |
| μCode | Outcome | Llama-3.2-1B/8B-Instruct | Paper | Code |
| R1-Code-Interpreter | Outcome | Qwen2.5-7B/14B-Instruct-1M, Qwen2.5-3B-Instruct | Paper | Code |
| IterPref | Process | Deepseek-Coder-7B-Instruct, Qwen2.5-Coder-7B, StarCoder2-15B | Paper | - |
| LeDex | Process | StarCoder-15B, CodeLlama-7B/13B | Paper | - |
| CTRL | Process | Qwen2.5-Coder-7B/14B/32B-Instruct | Paper | Code |
| ReVeal | Process | DAPO-Qwen-32B, Qwen2.5-32B-Instruc(not-working) | Paper | - |
| Posterior-GRPO | Process | Qwen2.5-Coder-3B/7B-Base, Qwen2.5-Math-7B | Paper | - |
| Policy Filtration for RLHF | Process | DeepSeek-Coder-6.7B, Qwen1.5-7B | Paper | Code |
| RL for Automated Software Engineering (SWE) | ||||
| DeepSWE | Outcome | Qwen3-32B | Blog | Code |
| SWE-RL | Outcome | Llama-3.3-70B-Instruct | Paper | Code |
| Satori-SWE | Outcome | Qwen-2.5-Math-7B | Paper | Code |
| RLCoder | Outcome | CodeLlama7B, StartCoder-7B, StarCoder2-7B, DeepSeekCoder-1B/7B | Paper | Code |
| Qwen3-Coder | Outcome | - | Paper | Code |
| ML-Agent | Outcome | Qwen2.5-7B-Base/Instruct, DeepSeek-R1-Distill-Qwen-7B | Paper | Code |
| Golubev et al. | Process | Qwen2.5-72B-Instruct | Paper | - |
| SWEET-RL | Process | Llama-3.1-8B/70B-Instruct | Paper | Code |
| Method | Reward | Link | Resource |
|---|---|---|---|
| RL for Informal Mathematical Reasoning | |||
| ARTIST | Outcome | Paper | - |
| ToRL | Outcome | Paper | Code Model |
| ZeroTIR | Outcome | Paper | Code Model |
| TTRL | Outcome | Paper | Code |
| RENT | Outcome | Paper | Code Website |
| Satori | Outcome | Paper | Code Model Website |
| 1-shot RLVR | Outcome | Paper | Code Model |
| Prover-Verifier Games (legibility) | Outcome | Paper | - |
| rStar2-Agent | Outcome | Paper | Code |
| START | Process | Paper | - |
| LADDER | Process | Paper | - |
| SWiRL | Process | Paper | - |
| RLoT | Process | Paper | Code |
| RL for Formal Mathematical Reasoning | |||
| DeepSeek-Prover-v1.5 | Outcome | Paper | Code Model |
| Leanabell-Prover | Outcome | Paper | Code Model |
| Kimina-Prover (Preview) | Outcome | Paper | Code Model |
| Seed-Prover | Outcome | Paper | Code |
| DeepSeek-Prover-v2 | Process | Paper | Code Model |
| ProofNet++ | Process | Paper | - |
| Leanabell-Prover-v2 | Process | Paper | Code |
| Hybrid | |||
| InternLM2.5-StepProver | Hybrid | Paper | Code |
| Lean-STaR | Hybrid | Paper | Code Model Website |
| STP | Hybrid | Paper | Code Model |
| Method | Paradigm | Environment | Link | Resource |
|---|---|---|---|---|
| Non-RL GUI Agents | ||||
| MM-Navigator | Vanilla VLM | - | Paper | Code |
| SeeAct | Vanilla VLM | - | Paper | Code |
| TRISHUL | Vanilla VLM | - | Paper | - |
| InfiGUIAgent | SFT | - | Paper | Code Model Website |
| UI-AGILE | SFT | - | Paper | Code Model |
| TongUI | SFT | - | Paper | Code Model Website |
| RL-based GUI Agents | ||||
| GUI-R1 | RL | Static | Paper | Code Model |
| UI-R1 | RL | Static | Paper | Code Model |
| InFiGUI-R1 | RL | Static | Paper | Code Model |
| AgentCPM | RL | Static | Paper | Code Model |
| WebAgent-R1 | RL | Interactive | Paper | - |
| Vattikonda et al. | RL | Interactive | Paper | - |
| UI-TARS | RL | Interactive | Paper | Code Model Website |
| DiGiRL | RL | Interactive | Paper | Code Model Website |
| ZeroGUI | RL | Interactive | Paper | Code |
| MobileGUI-RL | RL | Interactive | Paper | - |
TO BE ADDED
TO BE ADDED
“Dynamic” denotes whether the multi-agent system is task-dynamic, i.e., processes different task queries with different configurations (agent count, topologies, reasoning depth, prompts, etc).
“Train” denotes whether the method involves training the LLM backbone of agents.
| Method | Dynamic | Train | RL Algorithm | Link | Resource |
|---|---|---|---|---|---|
| RL-Free Multi-Agent Systems (not exhaustive) | |||||
| CAMEL | ✗ | ✗ | - | Paper | Code Model |
| MetaGPT | ✗ | ✗ | - | Paper | Code |
| MAD | ✗ | ✗ | - | Paper | Code |
| MoA | ✗ | ✗ | - | Paper | Code |
| AFlow | ✗ | ✗ | - | Paper | Code |
| RL-Based Multi-Agent Training | |||||
| GPTSwarm | ✗ | ✗ | policy gradient | Paper | Code Website |
| MaAS | ✓ | ✗ | policy gradient | Paper | Code |
| G-Designer | ✓ | ✗ | policy gradient | Paper | Code |
| MALT | ✗ | ✓ | DPO | Paper | - |
| MARFT | ✗ | ✓ | MARFT | Paper | Code |
| MAPoRL | ✓ | ✓ | PPO | Paper | Code |
| MLPO | ✓ | ✓ | MLPO | Paper | - |
| ReMA | ✓ | ✓ | MAMRP | Paper | Code |
| FlowReasoner | ✓ | ✓ | GRPO | Paper | Code |
| LERO | ✓ | ✓ | MLPO | Paper | - |
| CURE | ✗ | ✓ | rule-based RL | Paper | Code Model |
| MMedAgent-RL | ✗ | ✓ | GRPO | Paper | - |
TO BE ADDED
The agent capabilities are denoted by:
① Reasoning, ② Planning, ③ Tool Use, ④ Memory, ⑤ Collaboration, ⑥ Self-Improve.
| Environment / Benchmark | Agent Capability | Task Domain | Modality | Link | Resource |
|---|---|---|---|---|---|
| LMRL-Gym | ①, ④ | Interaction | Text | Paper | Code |
| ALFWorld | ②, ① | Embodied, Text Games | Text | Paper | Code Website |
| TextWorld | ②, ① | Text Games | Text | Paper | Code |
| ScienceWorld | ①, ② | Embodied, Science | Text | Paper | Code Website |
| AgentGym | ①, ④ | Text Games | Text | Paper | Code Website |
| Agentbench | ① | General | Text, Visual | Paper | Code |
| InternBootcamp | ① | General, Coding, Logic | Text | Paper | Code |
| LoCoMo | ④ | Interaction | Text | Paper | Code Website |
| MemoryAgentBench | ④ | Interaction | Text | Paper | Code |
| WebShop | ②, ③ | Web | Text | Paper | Code Website |
| Mind2Web | ②, ③ | Web | Text, Visual | Paper | Code Website |
| WebArena | ②, ③ | Web | Text | Paper | Code Website |
| VisualwebArena | ①, ②, ③ | Web | Text, Visual | Paper | Code Website |
| AppWorld | ②, ③ | App | Text | Paper | Code Website |
| AndroidWorld | ②, ③ | GUI, App | Text, Visual | Paper | Code |
| OSWorld | ②, ③ | GUI, OS | Text, Visual | Paper | Code Website |
| Debug-Gym | ①, ③ | SWE | Text | Paper | Code Website |
| MLE-Dojo | ②, ① | MLE | Text | Paper | Code Website |
| τ-bench | ①, ③ | SWE | Text | Paper | Code |
| TheAgentCompany | ②, ③, ⑤ | SWE | Text | Paper | Code Website |
| MedAgentGym | ① | Science | Text | Paper | Code |
| SecRepoBench | ①, ③ | Coding, Security | Text | Paper | - |
| R2E-Gym | ①, ② | SWE | Text | Paper | Code Website |
| HumanEval | ① | Coding | Text | Paper | Code |
| MBPP | ① | Coding | Text | Paper | Code |
| BigCodeBench | ① | Coding | Text | Paper | Code Website |
| LiveCodeBench | ① | Coding | Text | Paper | Code Website |
| SWE-bench | ①, ③ | SWE | Text | Paper | Code Website |
| SWE-rebench | ①, ③ | SWE | Text | Paper | Website |
| DevBench | ②, ① | SWE | Text | Paper | Code |
| ProjectEval | ②, ① | SWE | Text | Paper | Code Website |
| DA-Code | ①, ③ | Data Science, SWE | Text | Paper | Code Website |
| ColBench | ②, ①, ③ | SWE, Web Dev | Text | Paper | Code Website |
| NoCode-bench | ②, ① | SWE | Text | Paper | Code Website |
| MLE-Bench | ②, ①, ③ | MLE | Text | Paper | Code Website |
| PaperBench | ②, ①, ③ | MLE | Text | Paper | Code Website |
| Crafter | ②, ④ | Game | Visual | Paper | Code Website |
| Craftax | ②, ④ | Game | Visual | Paper | Code |
| ELLM (Crafter variant) | ②, ① | Game | Visual | Paper | Code Website |
| SMAC / SMAC-Exp | ⑤, ② | Game | Visual | Paper | Code |
| Factorio | ②, ① | Game | Visual | Paper | Code Website |
| Framework | Type | Key Features | Link | Resource |
|---|---|---|---|---|
| Agentic RL Frameworks | ||||
| Verifiers | Agent RL / LLM RL | Verifiable environment setup | - | Code |
| SkyRL-v0/v0.1 | Agent RL | Long-horizon real-world training | Blog (v0) Blog (v0.1) | Code |
| AREAL | Agent RL / LLM RL | Asynchronous training | Paper | Code |
| MARTI | Multi-agent RL / LLM RL | Integrated multi-agent training | - | Code |
| EasyR1 | Agent RL / LLM RL | Multimodal support | - | Code |
| AgentFly | Agent RL | Scalable asynchronous execution | Paper | Code |
| Agent Lightning | Agent RL | Decoupled hierarchical RL | Paper | Code |
| RLHF and LLM Fine-tuning Frameworks | ||||
| OpenRLHF | RLHF / LLM RL | High-performance scalable RLHF | Paper | Code |
| TRL | RLHF / LLM RL | Hugging Face RLHF | - | Code |
| trlX | RLHF / LLM RL | Distributed large-model RLHF | Paper | Code |
| HybridFlow | RLHF / LLM RL | Streamlined experiment management | Paper | Code |
| SLiMe | RLHF / LLM RL | High-performance async RL | - | Code |
| General-purpose RL Frameworks | ||||
| RLlib | General RL / Multi-agent RL | Production-grade scalable library | Paper | Code |
| Acme | General RL | Modular distributed components | Paper | Code |
| Tianshou | General RL | High-performance PyTorch platform | Paper | Code |
| Stable Baselines3 | General RL | Reliable PyTorch algorithms | Paper | Code |
| PFRL | General RL | Benchmarked prototyping algorithms | Paper | Code |