Skip to content

Data-Centric Perspectives on Agentic Retrieval-Augmented Generation: A Survey

License

Notifications You must be signed in to change notification settings

fatty-belly/Awesome-AgenticRAG-Data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

Awesome-AgenticRAG-Data

Awesome TechRxiv Maintenance Contribution Welcome License: MIT

Table of Contents

  1. Abstract
  2. Introduction
  3. Data Lifecycle
  4. Domain-Specific Agentic RAG Benchmarks
  5. Related Surveys

Abstract

Large Language Models (LLMs) excel at natural language understanding and generation, yet their reliance on static pre-training corpora may lead to outdated knowledge, hallucinations, and limited adaptability. Retrieval-Augmented Generation (RAG) mitigates these issues by grounding model outputs with external retrieval, but conventional RAG remains constrained by a fixed retrieve–then–generate routine and struggles with multi-step reasoning and tool calls. Agentic RAG addresses these limitations by enabling LLM agents to actively decompose tasks, issue exploratory queries, and refine evidence through iterative retrieval. Despite growing interest, the development of Agentic RAG is impeded by data scarcity: unlike traditional RAG, it requires challenging tasks that require planning, retrieval, and multiple reasoning decisions, and corresponding rich, interactive agent trajectories. This survey presents the first data-centric overview of Agentic RAG, framing its data lifecycle—data collecting, data preprocessing and task formulation, task construction, data for evaluation, and data enhancement for training—and cataloging representative systems and datasets in different domains (\eg question answering, web, software engineering). From data perspectives, we aim to guide the creation of scalable, high-quality datasets for the next generation of adaptive, knowledge-seeking LLM agents.


Introduction

Large Language Models (LLMs) have greatly advanced AI with strong natural language understanding and generation.
Yet their dependence on static pre-training data leads to outdated facts, hallucinations, and limited adaptability to fast-changing information. Retrieval-Augmented Generation (RAG) mitigates these issues by augmenting LLMs with retrieving real-time knowledge from external databases, APIs, or the web to ground generation.
Nevertheless, traditional RAG follows a fixed retrieve–then-generate routine and struggles with multi-step reasoning or iterative retrieval.

Recent developments in agentic AI introduce autonomous LLM-based agents that can plan, reflect, and coordinate tool use.
Combining this paradigm with RAG yields Agentic RAG, where agents actively drive retrieval, assess evidence, and refine outputs through iterative interaction.

Unlike traditional RAG, these RAG-reasoning agents perform active knowledge seeking: decomposing tasks, issuing exploratory queries to multiple sub-agents, and looping retrieval until sufficient information is obtained.

Comparison of traditional and Agentic RAG

Despite growing interest, Agentic RAG development is hindered by data scarcity.
Unlike traditional RAG—where static corpora suffice—Agentic RAG requires challenging tasks that require planning, retrieval, and multiple reasoning decisions, and corresponding rich, interactive agent trajectories.

Stage Traditional RAG Agentic RAG
Data Collection Static data (e.g., Wikipedia, ArXiv) Interactive data (e.g., tool/API usage, web navigation)
Task Construction Basic tasks (single-step, solvable with direct retrieval) Hard tasks (requiring decomposition, different tools, and reasoning)
Evaluation Metrics Correctness Multiple axes (e.g., correctness, efficiency, safety)
Data for Training Chain-of-Thought Thought–action trajectories, preference pairs, process rewards, new data generated during training for self-improvement

Table 1. Comparison of traditional RAG and Agentic RAG in data lifecycle.

Such data are costly to annotate, difficult to scale, and prone to quality issues when automatically synthesized. Therefore, curating scalable and high-quality datasets and benchmarks has been a central problem in the development of Agentic RAG systems.

The data curation process in Agentic RAG has two distinctive aspects:

  • Traditional RAG vs. Agentic RAG: traditional RAG relies on query–document pairs, whereas Agentic RAG demands rich agent–environment interaction traces encoding planning and retrieval actions.
  • Agentic RAG vs. general agents: general agents often use tools such as calculators or code interpreters for problem solving, whereas Agentic RAG uses search engines and knowledge bases for knowledge seeking. In the former cases, tools provide clear solutions, while in Agentic RAG, tools may actually bring more information for the agent to process.

This survey frames Agentic RAG through a data lifecycle that spans data collecting, data preprocessing and task formulation, task construction, data for evaluation, and data enhancement for training. Specifically, we adopt a generate-verify-filter/refine pipeline to analyze the curation process of tasks and trajectories.

Data lifecycle in Agentic RAG

Data Lifecycle

Overview

  1. Data Collecting
  2. Data Preprocessing and Task Formulation
  3. Task Construction: Annotation and Synthesis
  4. Data for Evaluation
  5. Data Enhancement for Training

Data Collecting

Static Data

  • Wikipedia
    • (TACL 2019) Natural Questions: A Benchmark for Question Answering Research [Paper] [Code] GitHub Repo stars
    • (EMNLP 2018) HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering [Paper] [Code] GitHub Repo stars
    • (COLING 2020) Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps (2WikiMultihopQA) [Paper] [Code] GitHub Repo stars
  • Github repositories
    • (ICLR 2024) SWE-bench: Can Language Models Resolve Real-world Github Issues? [Paper] [Code] GitHub Repo stars
  • Kaggle competitions
    • (ICLR 2025) MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering [Paper] [Code] GitHub Repo stars

Interactive data

  • API-based retrieval
    • (WWW 2025) FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research [Paper] [Code] GitHub Repo stars
  • Web navigation
    • WebGPT: Browser-assisted question-answering with human feedback [Paper]
    • WebDancer: Towards Autonomous Information Seeking Agency [Paper] [Code] GitHub Repo stars

Data Preprocessing and Task Formulation

Preprocessing

  • (EMNLP2025) LightRAG: Simple and Fast Retrieval-Augmented Generation [Paper] [Code] GitHub Repo stars (relation schemas)

  • T-GRAG: A Dynamic GraphRAG Framework for Resolving Temporal Conflicts and Redundancy in Knowledge Retrieval [Paper] [Code] GitHub Repo stars (chronological structure)

Task Formulation

  • Close-ended

    • (ACL 2017) TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension [Paper] [Code] GitHub Repo stars
    • (TACL 2019) Natural Questions: A Benchmark for Question Answering Research [Paper] [Code] GitHub Repo stars
  • Real-world workflows

    • (ICLR 2024) SWE-bench: Can Language Models Resolve Real-world Github Issues? [Paper] [Code] GitHub Repo stars
    • (ICLR 2025) MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering [Paper] [Code] GitHub Repo stars
  • Creative (Academic Writing)

    • (Neurips 2024) AutoSurvey: Large Language Models Can Automatically Write Surveys [Paper] [Code] GitHub Repo stars
    • (ACL 2025) SurveyForge: On the Outline Heuristics, Memory-Driven Generation, and Multi-dimensional Evaluation for Automated Survey Writing [Paper] [Code] GitHub Repo stars
    • SurveyX: Academic Survey Automation via Large Language Models [Paper] [Code] GitHub Repo stars
    • Agent Laboratory: Using LLM Agents as Research Assistants [Paper] [Code] GitHub Repo stars

Task Construction: Annotation and Synthesis

Generate

Curating Methods
  • Crowdsourced

    • (TACL 2019) Natural Questions: A Benchmark for Question Answering Research [Paper] [Code] GitHub Repo stars
    • (EMNLP 2018) HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering [Paper] [Code] GitHub Repo stars
    • Measuring short-form factuality in large language models (SimpleQA) [Paper] [Code] GitHub Repo stars
    • BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents [Paper] [Code] GitHub Repo stars
    • (ICLR 2024) GAIA: a benchmark for General AI Assistants [Paper] [Dataset]
  • Ready tasks on Internet

    • (ACL 2017) TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension [Paper] [Code] GitHub Repo stars
    • (ICLR 2024) SWE-bench: Can Language Models Resolve Real-world Github Issues? [Paper] [Code] GitHub Repo stars
    • (ICLR 2025) MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering [Paper] [Code] GitHub Repo stars
  • Synthetic

    • (COLING 2020) Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps (2WikiMultihopQA) [Paper] [Code] GitHub Repo stars
    • (ACL 2024) INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning [Paper] [Code] GitHub Repo stars
    • (Neurips 2024) Gorilla: Large Language Model Connected with Massive APIs [Paper] [Code] GitHub Repo stars
    • WebDancer: Towards Autonomous Information Seeking Agency [Paper] [Code] GitHub Repo stars
Difficulty Enhancement
  • Complexity

    • (EMNLP 2018) HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering [Paper] [Code] GitHub Repo stars (multi hops)
    • (COLING 2020) Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps (2WikiMultihopQA) [Paper] [Code] GitHub Repo stars (multi hops)
    • (TACL 2022) MuSiQue: Multihop Questions via Single-hop Question Composition [Paper] [Code] GitHub Repo stars (multi hops)
    • TaskCraft: Automated Generation of Agentic Tasks [Paper] [Code] GitHub Repo stars (multi hops)
    • WebDancer: Towards Autonomous Information Seeking Agency [Paper] [Code] GitHub Repo stars (multi hops)
    • (ACL 2024) On the Multi-turn Instruction Following for Conversational Web Agents [Paper] [Code] GitHub Repo stars (multi-turn conversations)
    • (ACL 2025) WebWalker: Benchmarking LLMs in Web Traversal [Paper] [Code] GitHub Repo stars (multiple webpages)
    • (ICLR 2024) SWE-bench: Can Language Models Resolve Real-world Github Issues? [Paper] [Code] GitHub Repo stars (repo-level coding)
    • (ICLR 2024) RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems [Paper] [Code] GitHub Repo stars (repo-level coding)
    • (ICLR 2024) GAIA: a benchmark for General AI Assistants [Paper] [Dataset] (multiple tools)
  • Uncertainty

    • (TACL 2021) Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies (StrategyQA) [Paper] [Code] GitHub Repo stars (implicit reasoning tasks)
    • (COLING 2020) Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps (2WikiMultihopQA) [Paper] [Code] GitHub Repo stars (distractors in reference documents)
    • (TACL 2022) MuSiQue: Multihop Questions via Single-hop Question Composition [Paper] [Code] GitHub Repo stars (distractors in reference documents, unanswerable questions)
    • WebSailor: Navigating Super-human Reasoning for Web Agent [Paper] [Code] GitHub Repo stars (obfuscate key information)
    • BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents [Paper] [Code] GitHub Repo stars (inverted problems)
  • Expertise

Verify

Methods
  • Human-based (inter-annotator agreement)

    • Measuring short-form factuality in large language models (SimpleQA) [Paper] [Code] GitHub Repo stars
    • BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents [Paper] [Code] GitHub Repo stars
    • (ICLR 2024) GAIA: a benchmark for General AI Assistants [Paper] [Dataset]
  • LLM-based

    • (ACL 2024 findings) Chain-of-Verification Reduces Hallucination in Large Language Models [Paper] [Code] GitHub Repo stars
Overlooked Validity Criteria
  • QA

    • Measuring short-form factuality in large language models (SimpleQA) [Paper] [Code] GitHub Repo stars (unique, time-invariant answer)
    • (ICLR 2024) GAIA: a benchmark for General AI Assistants [Paper] [Dataset] (unique, time-invariant answer)
  • Code

    • (ICLR 2024) SWE-bench: Can Language Models Resolve Real-world Github Issues? [Paper] [Code] GitHub Repo stars (environment reproducible, reference code passable)

Filter/Refine

Quality
  • (ACL 2025) WebWalker: Benchmarking LLMs in Web Traversal [Paper] [Code] GitHub Repo stars (linguistic naturalness)

  • A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains (Amazon-bench) [Paper] (linguistic naturalness)

  • (EMNLP 2020) Is Multihop QA in DIRE Condition? Measuring and Reducing Disconnected Reasoning [Paper] [Code] GitHub Repo stars (no data leakage or exploitable shortcuts)

  • (TACL 2022) MuSiQue: Multihop Questions via Single-hop Question Composition [Paper] [Code] GitHub Repo stars (no data leakage or exploitable shortcuts)

  • Agent Laboratory: Using LLM Agents as Research Assistants [Paper] [Code] GitHub Repo stars (source credibility)

Difficulty
  • rule-based

    • TaskCraft: Automated Generation of Agentic Tasks [Paper] [Code] GitHub Repo stars (number of hops)
    • (ACL 2025) WebWalker: Benchmarking LLMs in Web Traversal [Paper] [Code] GitHub Repo stars (number of hops)
    • (ICLR 2024) GAIA: a benchmark for General AI Assistants [Paper] [Dataset] (number of tools)
    • (TACL 2022) MuSiQue: Multihop Questions via Single-hop Question Composition [Paper] [Code] GitHub Repo stars (with or without unanswerable questions)
    • (COLM 2024) GPQA: A Graduate-Level Google-Proof Q&A Benchmark [Paper] [Code] GitHub Repo stars (accuracy of experts and non-experts)
  • LLM-based (LLM's success rate as proxy)

    • (Neurips 2024) Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization [Paper] [Code] GitHub Repo stars
    • TaskEval: Assessing Difficulty of Code Generation Tasks for Large Language Models [Paper]

Data for Evaluation

Decontamination

  • (TACL 2022) MuSiQue: Multihop Questions via Single-hop Question Composition [Paper] [Code] GitHub Repo stars (filter out multi-hop questions in test split with any identical single-hop component in train split)

  • (ICLR 2024) GAIA: a benchmark for General AI Assistants [Paper] [Dataset] (question does not exist on the internet in plain text)

Evaluation Metrics and Approaches

Correctness

For this part, please refer to task formulation for the papers.

  • Gold-standard answers

  • Programmatic validators

  • LLM-as-a-judge

Beyond Correctness
  • (ACL 2025) WebWalker: Benchmarking LLMs in Web Traversal [Paper] [Code] GitHub Repo stars (efficiency: the action count of successful agentic executions)

  • A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains (Amazon-bench) [Paper] (safety: benign failures vs. harmful failures)

Data Enhancement for Training

SFT

Basic Tool-usage Skills
  • (Neurips 2023) Toolformer: Language Models Can Teach Themselves to Use Tools [Paper] [Code] GitHub Repo stars (modify pretraining corpora)
  • (ACL 2024) INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning [Paper] [Code] GitHub Repo stars (integrate multiple resources into meta-datasets)
  • (Neurips 2024) Gorilla: Large Language Model Connected with Massive APIs [Paper] [Code] GitHub Repo stars (self-instruction and in-context learning)
Thought–action Trajectories
  • Generate

    • (Neurips 2022) STaR: Bootstrapping Reasoning With Reasoning [Paper] [Code] GitHub Repo stars (in-context bootstrapping)
    • Distilling LLM Agent into Small Models with Retrieval and Code Tools [Paper] [Code] GitHub Repo stars (trajectory distillation)
    • WebSailor: Navigating Super-human Reasoning for Web Agent [Paper] [Code] GitHub Repo stars (trajectory distillation)
    • WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization [Paper] [Code] GitHub Repo stars (trajectory distillation)
  • Filter/Refine

    • (ACL 2025 Findings) Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning [Paper] [Code] GitHub Repo stars (quality influenced by factors such as trajectory granularity, formatting choices, and the teacher model used)
    • WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization [Paper] [Code] GitHub Repo stars (conciseness: filters out trajectories with severe repetition)
    • WebSailor: Navigating Super-human Reasoning for Web Agent [Paper] [Code] GitHub Repo stars (conciseness: reconstructs concise rationales from action–observation sequences)
    • Deconstructing Long Chain-of-Thought: A Structured Reasoning Optimization Framework for Long CoT Distillation [Paper](conciseness: removes redundant or incorrect reasoning paths)

RL

Outcome-based Rewards
  • (COLM 2025) Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning [Paper] [Code] GitHub Repo stars
  • R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning [Paper] [Code] GitHub Repo stars
  • DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments [Paper] [Code] GitHub Repo stars
  • ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning [Paper] [Code] GitHub Repo stars
  • WebDancer: Towards Autonomous Information Seeking Agency [Paper] [Code] GitHub Repo stars
  • WebSailor: Navigating Super-human Reasoning for Web Agent [Paper] [Code] GitHub Repo stars
  • WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization [Paper] [Code] GitHub Repo stars
Data-aware Rewards
  • (COLM 2025) DeepRetrieval: Hacking Real Search Engines and Retrievers with Large Language Models via Reinforcement Learning [Paper] [Code] GitHub Repo stars (retrieval rewards)
  • ReZero: Enhancing LLM search ability by trying one-more-time [Paper] [Code] GitHub Repo stars (retrieval rewards)
  • (Neurips 2025) WebThinker: Empowering Large Reasoning Models with Deep Research Capability [Paper] [Code] GitHub Repo stars (preference pairs based on quality, efficiency, and conciseness)

Domain-Specific Agentic RAG Benchmarks

NameTaskSourceScaleMetricsData Curating Method
Question Answering (QA)
NQSingle-hop QAGoogle queries, Wikipediatrain 307k, dev 7.8k, test 7.8k-Select queries from Google. Search for relevant documents in Wikipedia, and ask annotators to identify answers and filter low-quality questions.
TriviaQASingle-hop QAQuiz websites, Wikipedia and Internettrain 76.5k, val 10.0k, test 9.5k-Select questions from 14 quiz websites. Search for relevant documents in Wikipedia and Internet, and keep those with answers.
SimpleQASingle-hop QACrowdsourced4326-Annotators create questions with unique time-invariant answer. All questions are verified by another person independently. Keep only those that are incorrectly answered at least once in 4 times by gpt-4.
HotpotQAMulti-hop QACrowdsourced from Wikipediatrain 90.4k, val 7.4k, test 7.4k-Build a relation graph from the links in Wikipedia. Choose relevant paragraphs from it, and ask annotators to create multi-hop questions based on the paragraphs and identify supporting facts in them.
2WikiMultihopQAMulti-hop QASynthesized from Wikipediatrain medium 155k, train hard 12.6k, dev 12.6k, test 12.6k-Classify the entities in Wikidata. Manually write different question templates, and sample entities to create questions. Filter out questions with no answer or multiple answers. Add distractors in supporting documents.
MuSiQueMulti-hop QASynthesized and annotated from Wikipediatrain 39.9k, val 4.8k, test 4.9k-Collect Wikipedia-based single-hop questions. Compose 2-hop questions and filter out those with shortcuts. Build different multi-hop question structures and crowdsource questions. Add distractors in supporting documents. Add unanswerable questions.
BamboogleMulti-hop QAManually created from Wikipedia125-Create 2-hop questions based on Wikipedia. Keep only those that cannot be directly searched for the correct answer.
TaskcraftMulti-hop QASynthesized from different corpus36k-Generate single-hop questions based on different corpus by LLM. Extend to multi-hop questions via depth-based and width-based extension. Filter out those with shortcuts.
Web
WebArenaQA-like & task-oriented web interactionCustom web environments (shopping, email, forum, map, social media)7 environments, 812 tasksTask success rateProvide realistic multi-page websites. Annotators design diverse tasks requiring navigation, reasoning and interaction.
AgentBenchOpen-ended web tasks with tool useReal-world web APIs and websites8 domains, 2000+ tasksSuccess rate, human evalCollect tasks from multiple domains (travel, shopping, QA, etc.). Provide tool APIs and human-verified success criteria.
GAIAComplex open-domain information-seekingLive web environment466 tasks (300 retained answers)F1 score, factual accuracyAsk annotators to design multi-step questions requiring reasoning, planning and external search. Include hidden evaluation sets to test real-time retrieval.
BrowseCompFact-seeking QA over web browsingInternet (open web), human-crafted QA1,266 questionsExact matchQuestions designed so answer is short and verifiable. Human annotators ensure difficulty (not solved by existing models, not in top search results), enforce time/effort thresholds.
WebWalkerQAMulti-hop QA via web navigationReal Wikipedia + open web680 questionsExact match, F1 scoreGenerate multi-hop QA pairs requiring active web navigation. Filter with LLM-based difficulty control and human verification.
Amazon-BenchE-commerceLive Amazon.com webpages400 user queries across 7 task typesTask success rate, harmful/benign failure rate, efficiencyExplore and categorize 60k+ Amazon pages. Sample diverse pages by functionality score, then prompt LLMs to generate realistic user queries and refine them to make them sound more natural and user-like.
Software Engineering
SWE-benchGenerate a pull request (PR) to solve a given issueGitHub issues from 12 Python repositoriestrain 19k, test 2294Unit test pass rateSelect PRs that resolve an issue and contribute tests. Keep only those that install successfully and passes all tests.
RepoBenchCode retrieval, code completionGithub-code dataset, Github Python and Java repositoriesPython 24k, Java 26kGolden snippet matching, line matchingRandom sample lines as completion goals (with a first-to-use subset). Extract candidate snippets based on import codes, and annotate golden snippets.
DevEvalRepository-level function completionPopular repositories from PyPI1874Unit test pass rate, recall of reference dependencySelect functions with test cases from repositories. Ask annotators to write requirements and reference dependencies. Filter out those with no cross-file dependency.
Machine Learning
MLAgentbenchImprove the performance metric by at least 10% over the baseline in the starter codeKaggle13Success rate of 10% improvement, total time and tokensManually construct task description, starter code and evaluation code.
MLEbenchAchieve the best score on a metric pre-defined for each competitionKaggle75Test score compared on leaderboard (e.g. medals)Crawl task description, dataset, grading code and leaderboard from Kaggle website. Keep only those reproducible and up-to-date. Manually label the category and difficulty.
Medical
MedQAFour-option multiple-choice questionNational Medical Board Examinationtrain 48.9k, dev 6.1k, test 8.1kExact matchCollect question-answer pairs from the National Medical Board Examination.
MedMCQAFour-option multiple-choice QA resembling medical examsOpen websites and books, All India Institute of Medical Sciences, National Eligibility cum Entrance Testtrain 18.2k, dev 4.2k, test 6.2kExact MatchCollect question-answer pairs from medical examinations. Use rule-based method to preprocess the data. Split the dataset by exams (the training set consists of questions from mock and online exams, while the developing and test set consists of questions from formal exams.)
Quilt-VQAVQA (Vision question answering)Educational histopathology videos in YoutubeImage-dependent: 1055, General-knowledge: 255LLM evaluationLocalize the "?"s in the video's transcript. Extract the relevant texts and images. Prompt GPT-4 to generate QA pairs. Perform a manual verification.
PathVQAVQAElectronic pathology textbooks and Pathology Education Informational Resource Digital Library websiteImages: 4998, QA pairs: 32799Accuracy(yes/no questions), exact match, Macro-averaged F1, BLEUExtract images and their captions from the data sources. Perform natural language processing of the captions to break a long sentence into several short ones and get POS tagging. Generate open-ended questions based on POS tags and named entities.
PMC-VQAVQAPMC-OAImages: 149k, QA pairs: 227kBLEU, accuracyPrompt ChatGPT with the images and captions to generate QA pairs. Perform LLM-based and manual data filtering.
PathMMUVQAPubMed, EduContent, Atlas, SocialPath, PathCLSImages: train 16312, val 510, test 7213; QA pairs: train 23041, val 710, test 9677-Extract image-caption pairs from the data source. Prompt GPT-4V to generate detailed description of images and then three questions per image. Perform expert validation.
Legal
LegalBenchIssue-spotting, rule-recall, rule-application and rule-conclusion, interpretation, rhetorical-understandingExisting datasets, in-house datasets9.1kAccuracy, human evaluationFilter and restructure the data from the data sources.
LegalBench-RAGRetrieve snippets from legal corporaLegalBench, PrivacyQA, CUAD, MAUD, ContractNLI6889Recall@k, precision@kStart from LegalBench queries. Trace back each query’s context to its original document span in the corpus. Final dataset pairs each query with its exact evidence.

Metrics for QA are generally string matching (exact/fuzzy) or F1, and are omitted in the table.

Related Surveys

  • Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG [Paper] [GitHub] GitHub Repo stars (a general survey on Agentic RAG pipelines and frameworks)
  • (EMNLP 2025) Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs [Paper] [GitHub] GitHub Repo stars (the reasoning methods and frameworks in Agentic RAG)

Contributing

We welcome contributions to expand this collection! To add your work, please:

  1. Submit a Pull Request or Open an Issue with the following information:

    • Paper Title: Your paper's full title
    • Paper Link: DOI, arXiv, or conference link
    • GitHub Repository: Link to your open-source implementation (if available)
    • Category: Specify which stage under our lifecycle your work belongs to:
      • Data Collecting: Static Data / Interactive Data
      • Data Preprocessing and Task Formulation: Preprocessing / Task Formulation
      • Task Construction: Annotation and Synthesis: Generate / Verify / Filter
      • Data for Evaluation: Decontamination / Evaluation Metrics and Approaches
      • Data Enhancement for Training: SFT / RL

    Notice that your work may belong to multiple stages. Please choose 1-3 main focus of your work.

  2. Format: Follow the existing format in the README for consistency.

  3. Relevance: Ensure your work is relevant to Agentic RAG data.

Your contributions help build a comprehensive resource for the research community!

About

Data-Centric Perspectives on Agentic Retrieval-Augmented Generation: A Survey

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published