Large Language Models (LLMs) excel at natural language understanding and generation, yet their reliance on static pre-training corpora may lead to outdated knowledge, hallucinations, and limited adaptability. Retrieval-Augmented Generation (RAG) mitigates these issues by grounding model outputs with external retrieval, but conventional RAG remains constrained by a fixed retrieve–then–generate routine and struggles with multi-step reasoning and tool calls. Agentic RAG addresses these limitations by enabling LLM agents to actively decompose tasks, issue exploratory queries, and refine evidence through iterative retrieval. Despite growing interest, the development of Agentic RAG is impeded by data scarcity: unlike traditional RAG, it requires challenging tasks that require planning, retrieval, and multiple reasoning decisions, and corresponding rich, interactive agent trajectories. This survey presents the first data-centric overview of Agentic RAG, framing its data lifecycle—data collecting, data preprocessing and task formulation, task construction, data for evaluation, and data enhancement for training—and cataloging representative systems and datasets in different domains (\eg question answering, web, software engineering). From data perspectives, we aim to guide the creation of scalable, high-quality datasets for the next generation of adaptive, knowledge-seeking LLM agents.
Large Language Models (LLMs) have greatly advanced AI with strong natural language understanding and generation.
Yet their dependence on static pre-training data leads to outdated facts, hallucinations, and limited adaptability to fast-changing information. Retrieval-Augmented Generation (RAG) mitigates these issues by augmenting LLMs with retrieving real-time knowledge from external databases, APIs, or the web to ground generation.
Nevertheless, traditional RAG follows a fixed retrieve–then-generate routine and struggles with multi-step reasoning or iterative retrieval.
Recent developments in agentic AI introduce autonomous LLM-based agents that can plan, reflect, and coordinate tool use.
Combining this paradigm with RAG yields Agentic RAG, where agents actively drive retrieval, assess evidence, and refine outputs through iterative interaction.
Unlike traditional RAG, these RAG-reasoning agents perform active knowledge seeking: decomposing tasks, issuing exploratory queries to multiple sub-agents, and looping retrieval until sufficient information is obtained.
Despite growing interest, Agentic RAG development is hindered by data scarcity.
Unlike traditional RAG—where static corpora suffice—Agentic RAG requires challenging tasks that require planning, retrieval, and multiple reasoning decisions, and corresponding rich, interactive agent trajectories.
| Stage | Traditional RAG | Agentic RAG |
|---|---|---|
| Data Collection | Static data (e.g., Wikipedia, ArXiv) | Interactive data (e.g., tool/API usage, web navigation) |
| Task Construction | Basic tasks (single-step, solvable with direct retrieval) | Hard tasks (requiring decomposition, different tools, and reasoning) |
| Evaluation Metrics | Correctness | Multiple axes (e.g., correctness, efficiency, safety) |
| Data for Training | Chain-of-Thought | Thought–action trajectories, preference pairs, process rewards, new data generated during training for self-improvement |
Table 1. Comparison of traditional RAG and Agentic RAG in data lifecycle.
Such data are costly to annotate, difficult to scale, and prone to quality issues when automatically synthesized. Therefore, curating scalable and high-quality datasets and benchmarks has been a central problem in the development of Agentic RAG systems.
The data curation process in Agentic RAG has two distinctive aspects:
- Traditional RAG vs. Agentic RAG: traditional RAG relies on query–document pairs, whereas Agentic RAG demands rich agent–environment interaction traces encoding planning and retrieval actions.
- Agentic RAG vs. general agents: general agents often use tools such as calculators or code interpreters for problem solving, whereas Agentic RAG uses search engines and knowledge bases for knowledge seeking. In the former cases, tools provide clear solutions, while in Agentic RAG, tools may actually bring more information for the agent to process.
This survey frames Agentic RAG through a data lifecycle that spans data collecting, data preprocessing and task formulation, task construction, data for evaluation, and data enhancement for training. Specifically, we adopt a generate-verify-filter/refine pipeline to analyze the curation process of tasks and trajectories.
- Data Collecting
- Data Preprocessing and Task Formulation
- Task Construction: Annotation and Synthesis
- Data for Evaluation
- Data Enhancement for Training
- Wikipedia
- (TACL 2019) Natural Questions: A Benchmark for Question Answering Research [Paper] [Code]
- (EMNLP 2018) HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering [Paper] [Code]
- (COLING 2020) Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps (2WikiMultihopQA) [Paper] [Code]
- (TACL 2019) Natural Questions: A Benchmark for Question Answering Research [Paper] [Code]
- Github repositories
- Kaggle competitions
- API-based retrieval
- Web navigation
-
(EMNLP2025) LightRAG: Simple and Fast Retrieval-Augmented Generation [Paper] [Code]
(relation schemas)
-
T-GRAG: A Dynamic GraphRAG Framework for Resolving Temporal Conflicts and Redundancy in Knowledge Retrieval [Paper] [Code]
(chronological structure)
-
Close-ended
-
Real-world workflows
-
Creative (Academic Writing)
- (Neurips 2024) AutoSurvey: Large Language Models Can Automatically Write Surveys [Paper] [Code]
- (ACL 2025) SurveyForge: On the Outline Heuristics, Memory-Driven Generation, and Multi-dimensional Evaluation for Automated Survey Writing [Paper] [Code]
- SurveyX: Academic Survey Automation via Large Language Models [Paper] [Code]
- Agent Laboratory: Using LLM Agents as Research Assistants [Paper] [Code]
- (Neurips 2024) AutoSurvey: Large Language Models Can Automatically Write Surveys [Paper] [Code]
-
Crowdsourced
- (TACL 2019) Natural Questions: A Benchmark for Question Answering Research [Paper] [Code]
- (EMNLP 2018) HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering [Paper] [Code]
- Measuring short-form factuality in large language models (SimpleQA) [Paper] [Code]
- BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents [Paper] [Code]
- (ICLR 2024) GAIA: a benchmark for General AI Assistants [Paper] [Dataset]
- (TACL 2019) Natural Questions: A Benchmark for Question Answering Research [Paper] [Code]
-
Ready tasks on Internet
- (ACL 2017) TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension [Paper] [Code]
- (ICLR 2024) SWE-bench: Can Language Models Resolve Real-world Github Issues? [Paper] [Code]
- (ICLR 2025) MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering [Paper] [Code]
- (ACL 2017) TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension [Paper] [Code]
-
Synthetic
- (COLING 2020) Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps (2WikiMultihopQA) [Paper] [Code]
- (ACL 2024) INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning [Paper] [Code]
- (Neurips 2024) Gorilla: Large Language Model Connected with Massive APIs [Paper] [Code]
- WebDancer: Towards Autonomous Information Seeking Agency [Paper] [Code]
- (COLING 2020) Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps (2WikiMultihopQA) [Paper] [Code]
-
Complexity
- (EMNLP 2018) HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering [Paper] [Code]
(multi hops)
- (COLING 2020) Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps (2WikiMultihopQA) [Paper] [Code]
(multi hops)
- (TACL 2022) MuSiQue: Multihop Questions via Single-hop Question Composition [Paper] [Code]
(multi hops)
- TaskCraft: Automated Generation of Agentic Tasks [Paper] [Code]
(multi hops)
- WebDancer: Towards Autonomous Information Seeking Agency [Paper] [Code]
(multi hops)
- (ACL 2024) On the Multi-turn Instruction Following for Conversational Web Agents [Paper] [Code]
(multi-turn conversations)
- (ACL 2025) WebWalker: Benchmarking LLMs in Web Traversal [Paper] [Code]
(multiple webpages)
- (ICLR 2024) SWE-bench: Can Language Models Resolve Real-world Github Issues? [Paper] [Code]
(repo-level coding)
- (ICLR 2024) RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems [Paper] [Code]
(repo-level coding)
- (ICLR 2024) GAIA: a benchmark for General AI Assistants [Paper] [Dataset] (multiple tools)
- (EMNLP 2018) HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering [Paper] [Code]
-
Uncertainty
- (TACL 2021) Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies (StrategyQA) [Paper] [Code]
(implicit reasoning tasks)
- (COLING 2020) Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps (2WikiMultihopQA) [Paper] [Code]
(distractors in reference documents)
- (TACL 2022) MuSiQue: Multihop Questions via Single-hop Question Composition [Paper] [Code]
(distractors in reference documents, unanswerable questions)
- WebSailor: Navigating Super-human Reasoning for Web Agent [Paper] [Code]
(obfuscate key information)
- BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents [Paper] [Code]
(inverted problems)
- (TACL 2021) Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies (StrategyQA) [Paper] [Code]
-
Expertise
-
Human-based (inter-annotator agreement)
-
LLM-based
-
QA
-
Code
-
(ACL 2025) WebWalker: Benchmarking LLMs in Web Traversal [Paper] [Code]
(linguistic naturalness)
-
A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains (Amazon-bench) [Paper] (linguistic naturalness)
-
(EMNLP 2020) Is Multihop QA in DIRE Condition? Measuring and Reducing Disconnected Reasoning [Paper] [Code]
(no data leakage or exploitable shortcuts)
-
(TACL 2022) MuSiQue: Multihop Questions via Single-hop Question Composition [Paper] [Code]
(no data leakage or exploitable shortcuts)
-
Agent Laboratory: Using LLM Agents as Research Assistants [Paper] [Code]
(source credibility)
-
rule-based
- TaskCraft: Automated Generation of Agentic Tasks [Paper] [Code]
(number of hops)
- (ACL 2025) WebWalker: Benchmarking LLMs in Web Traversal [Paper] [Code]
(number of hops)
- (ICLR 2024) GAIA: a benchmark for General AI Assistants [Paper] [Dataset] (number of tools)
- (TACL 2022) MuSiQue: Multihop Questions via Single-hop Question Composition [Paper] [Code]
(with or without unanswerable questions)
- (COLM 2024) GPQA: A Graduate-Level Google-Proof Q&A Benchmark [Paper] [Code]
(accuracy of experts and non-experts)
- TaskCraft: Automated Generation of Agentic Tasks [Paper] [Code]
-
LLM-based (LLM's success rate as proxy)
-
(TACL 2022) MuSiQue: Multihop Questions via Single-hop Question Composition [Paper] [Code]
(filter out multi-hop questions in test split with any identical single-hop component in train split)
-
(ICLR 2024) GAIA: a benchmark for General AI Assistants [Paper] [Dataset] (question does not exist on the internet in plain text)
For this part, please refer to task formulation for the papers.
-
Gold-standard answers
-
Programmatic validators
-
LLM-as-a-judge
-
(ACL 2025) WebWalker: Benchmarking LLMs in Web Traversal [Paper] [Code]
(efficiency: the action count of successful agentic executions)
-
A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains (Amazon-bench) [Paper] (safety: benign failures vs. harmful failures)
- (Neurips 2023) Toolformer: Language Models Can Teach Themselves to Use Tools [Paper] [Code]
(modify pretraining corpora)
- (ACL 2024) INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning [Paper] [Code]
(integrate multiple resources into meta-datasets)
- (Neurips 2024) Gorilla: Large Language Model Connected with Massive APIs [Paper] [Code]
(self-instruction and in-context learning)
-
Generate
- (Neurips 2022) STaR: Bootstrapping Reasoning With Reasoning [Paper] [Code]
(in-context bootstrapping)
- Distilling LLM Agent into Small Models with Retrieval and Code Tools [Paper] [Code]
(trajectory distillation)
- WebSailor: Navigating Super-human Reasoning for Web Agent [Paper] [Code]
(trajectory distillation)
- WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization [Paper] [Code]
(trajectory distillation)
- (Neurips 2022) STaR: Bootstrapping Reasoning With Reasoning [Paper] [Code]
-
Filter/Refine
- (ACL 2025 Findings) Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning [Paper] [Code]
(quality influenced by factors such as trajectory granularity, formatting choices, and the teacher model used)
- WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization [Paper] [Code]
(conciseness: filters out trajectories with severe repetition)
- WebSailor: Navigating Super-human Reasoning for Web Agent [Paper] [Code]
(conciseness: reconstructs concise rationales from action–observation sequences)
- Deconstructing Long Chain-of-Thought: A Structured Reasoning Optimization Framework for Long CoT Distillation [Paper](conciseness: removes redundant or incorrect reasoning paths)
- (ACL 2025 Findings) Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning [Paper] [Code]
- (COLM 2025) Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning [Paper] [Code]
- R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning [Paper] [Code]
- DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments [Paper] [Code]
- ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning [Paper] [Code]
- WebDancer: Towards Autonomous Information Seeking Agency [Paper] [Code]
- WebSailor: Navigating Super-human Reasoning for Web Agent [Paper] [Code]
- WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization [Paper] [Code]
- (COLM 2025) DeepRetrieval: Hacking Real Search Engines and Retrievers with Large Language Models via Reinforcement Learning [Paper] [Code]
(retrieval rewards)
- ReZero: Enhancing LLM search ability by trying one-more-time [Paper] [Code]
(retrieval rewards)
- (Neurips 2025) WebThinker: Empowering Large Reasoning Models with Deep Research Capability [Paper] [Code]
(preference pairs based on quality, efficiency, and conciseness)
| Name | Task | Source | Scale | Metrics | Data Curating Method |
|---|---|---|---|---|---|
| Question Answering (QA) | |||||
| NQ | Single-hop QA | Google queries, Wikipedia | train 307k, dev 7.8k, test 7.8k | - | Select queries from Google. Search for relevant documents in Wikipedia, and ask annotators to identify answers and filter low-quality questions. |
| TriviaQA | Single-hop QA | Quiz websites, Wikipedia and Internet | train 76.5k, val 10.0k, test 9.5k | - | Select questions from 14 quiz websites. Search for relevant documents in Wikipedia and Internet, and keep those with answers. |
| SimpleQA | Single-hop QA | Crowdsourced | 4326 | - | Annotators create questions with unique time-invariant answer. All questions are verified by another person independently. Keep only those that are incorrectly answered at least once in 4 times by gpt-4. |
| HotpotQA | Multi-hop QA | Crowdsourced from Wikipedia | train 90.4k, val 7.4k, test 7.4k | - | Build a relation graph from the links in Wikipedia. Choose relevant paragraphs from it, and ask annotators to create multi-hop questions based on the paragraphs and identify supporting facts in them. |
| 2WikiMultihopQA | Multi-hop QA | Synthesized from Wikipedia | train medium 155k, train hard 12.6k, dev 12.6k, test 12.6k | - | Classify the entities in Wikidata. Manually write different question templates, and sample entities to create questions. Filter out questions with no answer or multiple answers. Add distractors in supporting documents. |
| MuSiQue | Multi-hop QA | Synthesized and annotated from Wikipedia | train 39.9k, val 4.8k, test 4.9k | - | Collect Wikipedia-based single-hop questions. Compose 2-hop questions and filter out those with shortcuts. Build different multi-hop question structures and crowdsource questions. Add distractors in supporting documents. Add unanswerable questions. |
| Bamboogle | Multi-hop QA | Manually created from Wikipedia | 125 | - | Create 2-hop questions based on Wikipedia. Keep only those that cannot be directly searched for the correct answer. |
| Taskcraft | Multi-hop QA | Synthesized from different corpus | 36k | - | Generate single-hop questions based on different corpus by LLM. Extend to multi-hop questions via depth-based and width-based extension. Filter out those with shortcuts. |
| Web | |||||
| WebArena | QA-like & task-oriented web interaction | Custom web environments (shopping, email, forum, map, social media) | 7 environments, 812 tasks | Task success rate | Provide realistic multi-page websites. Annotators design diverse tasks requiring navigation, reasoning and interaction. |
| AgentBench | Open-ended web tasks with tool use | Real-world web APIs and websites | 8 domains, 2000+ tasks | Success rate, human eval | Collect tasks from multiple domains (travel, shopping, QA, etc.). Provide tool APIs and human-verified success criteria. |
| GAIA | Complex open-domain information-seeking | Live web environment | 466 tasks (300 retained answers) | F1 score, factual accuracy | Ask annotators to design multi-step questions requiring reasoning, planning and external search. Include hidden evaluation sets to test real-time retrieval. |
| BrowseComp | Fact-seeking QA over web browsing | Internet (open web), human-crafted QA | 1,266 questions | Exact match | Questions designed so answer is short and verifiable. Human annotators ensure difficulty (not solved by existing models, not in top search results), enforce time/effort thresholds. |
| WebWalkerQA | Multi-hop QA via web navigation | Real Wikipedia + open web | 680 questions | Exact match, F1 score | Generate multi-hop QA pairs requiring active web navigation. Filter with LLM-based difficulty control and human verification. |
| Amazon-Bench | E-commerce | Live Amazon.com webpages | 400 user queries across 7 task types | Task success rate, harmful/benign failure rate, efficiency | Explore and categorize 60k+ Amazon pages. Sample diverse pages by functionality score, then prompt LLMs to generate realistic user queries and refine them to make them sound more natural and user-like. |
| Software Engineering | |||||
| SWE-bench | Generate a pull request (PR) to solve a given issue | GitHub issues from 12 Python repositories | train 19k, test 2294 | Unit test pass rate | Select PRs that resolve an issue and contribute tests. Keep only those that install successfully and passes all tests. |
| RepoBench | Code retrieval, code completion | Github-code dataset, Github Python and Java repositories | Python 24k, Java 26k | Golden snippet matching, line matching | Random sample lines as completion goals (with a first-to-use subset). Extract candidate snippets based on import codes, and annotate golden snippets. |
| DevEval | Repository-level function completion | Popular repositories from PyPI | 1874 | Unit test pass rate, recall of reference dependency | Select functions with test cases from repositories. Ask annotators to write requirements and reference dependencies. Filter out those with no cross-file dependency. |
| Machine Learning | |||||
| MLAgentbench | Improve the performance metric by at least 10% over the baseline in the starter code | Kaggle | 13 | Success rate of 10% improvement, total time and tokens | Manually construct task description, starter code and evaluation code. |
| MLEbench | Achieve the best score on a metric pre-defined for each competition | Kaggle | 75 | Test score compared on leaderboard (e.g. medals) | Crawl task description, dataset, grading code and leaderboard from Kaggle website. Keep only those reproducible and up-to-date. Manually label the category and difficulty. |
| Medical | |||||
| MedQA | Four-option multiple-choice question | National Medical Board Examination | train 48.9k, dev 6.1k, test 8.1k | Exact match | Collect question-answer pairs from the National Medical Board Examination. |
| MedMCQA | Four-option multiple-choice QA resembling medical exams | Open websites and books, All India Institute of Medical Sciences, National Eligibility cum Entrance Test | train 18.2k, dev 4.2k, test 6.2k | Exact Match | Collect question-answer pairs from medical examinations. Use rule-based method to preprocess the data. Split the dataset by exams (the training set consists of questions from mock and online exams, while the developing and test set consists of questions from formal exams.) |
| Quilt-VQA | VQA (Vision question answering) | Educational histopathology videos in Youtube | Image-dependent: 1055, General-knowledge: 255 | LLM evaluation | Localize the "?"s in the video's transcript. Extract the relevant texts and images. Prompt GPT-4 to generate QA pairs. Perform a manual verification. |
| PathVQA | VQA | Electronic pathology textbooks and Pathology Education Informational Resource Digital Library website | Images: 4998, QA pairs: 32799 | Accuracy(yes/no questions), exact match, Macro-averaged F1, BLEU | Extract images and their captions from the data sources. Perform natural language processing of the captions to break a long sentence into several short ones and get POS tagging. Generate open-ended questions based on POS tags and named entities. |
| PMC-VQA | VQA | PMC-OA | Images: 149k, QA pairs: 227k | BLEU, accuracy | Prompt ChatGPT with the images and captions to generate QA pairs. Perform LLM-based and manual data filtering. |
| PathMMU | VQA | PubMed, EduContent, Atlas, SocialPath, PathCLS | Images: train 16312, val 510, test 7213; QA pairs: train 23041, val 710, test 9677 | - | Extract image-caption pairs from the data source. Prompt GPT-4V to generate detailed description of images and then three questions per image. Perform expert validation. |
| Legal | |||||
| LegalBench | Issue-spotting, rule-recall, rule-application and rule-conclusion, interpretation, rhetorical-understanding | Existing datasets, in-house datasets | 9.1k | Accuracy, human evaluation | Filter and restructure the data from the data sources. |
| LegalBench-RAG | Retrieve snippets from legal corpora | LegalBench, PrivacyQA, CUAD, MAUD, ContractNLI | 6889 | Recall@k, precision@k | Start from LegalBench queries. Trace back each query’s context to its original document span in the corpus. Final dataset pairs each query with its exact evidence. |
Metrics for QA are generally string matching (exact/fuzzy) or F1, and are omitted in the table.
- Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG [Paper] [GitHub]
(a general survey on Agentic RAG pipelines and frameworks)
- (EMNLP 2025) Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs [Paper] [GitHub]
(the reasoning methods and frameworks in Agentic RAG)
We welcome contributions to expand this collection! To add your work, please:
-
Submit a Pull Request or Open an Issue with the following information:
- Paper Title: Your paper's full title
- Paper Link: DOI, arXiv, or conference link
- GitHub Repository: Link to your open-source implementation (if available)
- Category: Specify which stage under our lifecycle your work belongs to:
- Data Collecting: Static Data / Interactive Data
- Data Preprocessing and Task Formulation: Preprocessing / Task Formulation
- Task Construction: Annotation and Synthesis: Generate / Verify / Filter
- Data for Evaluation: Decontamination / Evaluation Metrics and Approaches
- Data Enhancement for Training: SFT / RL
Notice that your work may belong to multiple stages. Please choose 1-3 main focus of your work.
-
Format: Follow the existing format in the README for consistency.
-
Relevance: Ensure your work is relevant to Agentic RAG data.
Your contributions help build a comprehensive resource for the research community!

