Skip to content

SJTU-DMTai/Data-Agent-Reading-List

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 

Repository files navigation

Awesome Data Agent and Agent Memory Paper

Benchmark

Data Agent Benchmark

Benchmark Name Corresp. Author Year Source TLDR
InfiAgent-DABench Jingjing Xu 2024 Link Repo Benchmark from ICML'24 Paper. Benchmark for evaluating agents on closed-form data analysis tasks. Sub-tasks: Data querying, processing, and reasoning over CSV files using Python. Input: Natural language questions paired with CSV files. Output: Closed-form answers (e.g., specific values or strings) matched against canonical labels.
FDABench Gao Cong 2025 Link Repo Benchmark for agents in multi-source analytical scenarios. Sub-tasks: Single-source structured analysis, unstructured data retrieval, and cross-source heterogeneous data fusion. Input: Analytical queries spanning structured (DB) and unstructured (PDF/Audio) sources. Output: SQL queries, precise values, or comprehensive analytical reports. Metrics: Exact-match for MC/checkbox tasks; text tasks by overlap metrics (e.g., ROUGE) plus tool-use success/recall; also reports efficiency (latency, model/tool calls, token/cost).
DSBench Dong Yu 2025 Link Repo Benchmark from ICLR'25 Paper. Realistic benchmark evaluating data science agents on complex tasks. Sub-tasks: Data cleaning, transformation, visualization, and modeling using libraries like Pandas/Scikit-learn. Input: Long-horizon problem descriptions with access to live data stacks. Output: Executable code and final artifacts (charts, tables, or answers). Metrics: Per-task 0–5 rubric with penalties for extra user interaction, corrections, or hallucinations; summarize overall score and hallucination rate.
DABStep Thomas Wolf 2025 Link Repo Benchmark focusing on iterative multi-step reasoning in financial analytics. Sub-tasks: Code-based data manipulation and cross-referencing structured tables with unstructured documentation. Input: Complex queries requiring multi-step navigation of heterogeneous data. Output: Factoid-style answers (string/numeric) verifiable by automatic scoring.
DataSciBench Yisong Yue 2025 Link Repo Comprehensive benchmark evaluating uncertain ground truth using the Task-Function-Code framework. Sub-tasks: Data cleaning, exploration, visualization, predictive modeling, and report generation. Input: Natural language prompts accompanied by datasets. Output: Executable code and valid final answers derived from the data.
DA-Code Kang Liu 2024 Link Repo Challenging benchmark for agentic code generation in data science. Sub-tasks: Complex data wrangling, analytics via code generation. Input: Natural language descriptions of domain-specific problems within input data.
DSEval Kan Ren 2024 Link Repo Evaluation paradigm covering the full data science lifecycle via bootstrapped annotation. Sub-tasks: Problem definition, data processing, and modeling across datasets like Kaggle or LeetCode. Input: Sequences of interdependent data science problems. Output: Code solutions and execution results for each iteration.
WebDS Christopher D. Manning 2025 Link End-to-end web-based benchmark reflecting real-world analytics workflows. Sub-tasks: Autonomous web browsing, data acquisition, cleaning, analysis, and visualization. Input: High-level analytical goals with access to 29 diverse websites. Output: Summarized analyses and insights.
PredictiQ Xiaojun Ma 2025 Link Repo Benchmark specialized in predictive analysis capabilities across diverse fields. Sub-tasks: Text analysis and code generation for prediction tasks. Input: Sophisticated predictive queries paired with real-world datasets. Output: Prediction results with verified text-code alignment.
InsightBench Issam Hadj Laradji 2025 Paper Repo Benchmark for end-to-end business analytics and insight discovery. Sub-tasks: Formulating questions, interpreting answers, and summarizing findings. Input: Business datasets with high-level analytic goals. Output: Summary of discovered insights and actionable steps.

Latest Benchmarks for Tabular Data (Since 2024)

Benchmark Name Task Type Year Source TLDR
MDBench Reasoning 2025 Link MDBench introduces a new multi-document reasoning benchmark synthetically generated through knowledge-guided prompting.
MMQA Reasoning 2025 Link Repo MMQA is a multi-table multi-hop question answering dataset with 3,312 tables across 138 domains, evaluating LLMs' capabilities in multi-table retrieval, Text-to-SQL, Table QA, and primary/foreign key selection.
ToRR Reasoning 2025 Link Repo ToRR is a benchmark assessing LLMs' table reasoning and robustness across 10 datasets with diverse table serializations and perturbations, revealing models' brittleness to format variations.
MMTU Comprehensive 2025 Link Repo MMTU is a massive multi-task table understanding and reasoning benchmark with over 30K questions across 25 real-world table tasks, designed to evaluate models' ability to understand, reason, and manipulate tables.
RADAR Reasoning 2025 Link Repo RADAR is a benchmark for evaluating language models' data-aware reasoning on imperfect tabular data with 5 common data artifact types like outlier value or inconsistent format, which ensures that direct calculation on the perturbed table will yield an incorrect answer, forcing the model to handle the artifacts to obtain the correct result.
Spider2 Text2SQL 2025 Link Repo Evaluation framework for real-world enterprise text-to-SQL workflows. Sub-tasks: Interacting with complex SQL environments (BigQuery, Snowflake), handling diverse operations, and processing long contexts. Input: Natural language questions with enterprise-level database schemas. Output: Complex SQL queries (often >100 lines) to solve the workflow.
DataBench Reasoning 2024 Link Repo Benchmark for Question Answering over Tabular Data assessing semantic reasoning. Sub-tasks: Answering questions requiring numerical, boolean, or categorical reasoning over diverse datasets. Input: Natural language questions paired with 65 real-world tabular datasets (CSV/Parquet). Output: Exact answer values (Boolean, Number, Category, or List) derived from the table.
TableBench Reasoning 2024 Link Repo Comprehensive benchmark for Table QA covering 18 fields and 4 complexity categories. Sub-tasks: Fact-checking, numerical reasoning, data analysis, and visualization. Input: Natural language questions paired with tables emphasizing numerical data. Output: Final answers derived through complex reasoning steps (e.g., Chain-of-Thought).
TQA-Bench Reasoning 2024 Link Repo Benchmark for evaluating LLMs on multi-table question answering with scalable context. Sub-tasks: Reasoning across multiple interconnected tables and handling long-context serialization (up to 64k tokens). Input: Natural language questions paired with multi-table relational databases. Output: Answers or SQL queries derived from joining and analyzing multiple tables.
SpreadsheetBench Reasoning 2024 Link Repo Benchmark for spreadsheet manipulation derived exclusively from real-world scenarios. Sub-tasks: Find, extract, sum, highlight, remove, modify, count, delete, calculate, and display. Input: Real-world user instructions from Excel forums paired with complex spreadsheet files. Output: Modified spreadsheet files or specific values matching the instruction.

Selected Classical Reasoning Benchmarks for Tabular Data

Benchmark Name Task Type Year Source
FinQA Reasoning 2021 Link Repo
FeTaQA Reasoning 2021 Link Repo
HiTab Reasoning 2022 Link Repo

Conference Paper (including arxiv)

Venue Paper Authors Links
SIGMOD'26 ST-Raptor: LLM-Powered Semi-Structured Table Question Answering Xuanhe Zhou paper
Arxiv/2601 Can We Predict Before Executing Machine Learning Agents? Ningyu Zhang paper
CIDR'25 AOP: Automated and Interactive LLM Pipeline Orchestration for Answering Complex Queries Guoliang Li paper
CIDR'25 Palimpzest: Optimizing AI-Powered Analytics with Declarative Query Processing Gerardo Vitagliano paper
IEEE Data Eng. Bull. iDataLake: An LLM-Powered Analytics System on Data Lakes Guoliang Li paper
VLDB'25 Towards Automated Cross-domain Exploratory Data Analysis through Large Language Models Qi Liu paper
VLDB'25 AutoPrep: Natural Language Question-Aware Data Preparation with a Multi-Agent Framework Ju Fan paper
VLDB'25 DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing Eugene Wu paper
VLDB'25 Data Imputation with Limited Data Redundancy Using Data Lakes Yuyu Luo paper
VLDB'25 Semantic Operators and Their Optimization: Enabling LLM-Based Data Processing with Accuracy Guarantees in LOTUS Matei Zaharia paper
VLDB'25 Weak-to-Strong Prompts with Lightweight-to-Powerful LLMs for High-Accuracy, Low-Cost, and Explainable Data Transformation Yuyu Luo, Ju Fan, and Nan Tang et al paper
SIGMOD'25 Automatic Database Configuration Debugging using Retrieval-Augmented Language Models Nan Tang, Ju Fan, et al paper
SIGMOD'25 Andromeda: Debugging Database Performance Issues with Retrieval-Augmented Large Language Models Nan Tang, Ju Fan, Xiaoyong Du, et al paper
ICDE'25 DataLab: A Unified Platform for LLM-Powered Business Intelligence Wei Chen paper
ICML'25 AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML Sung Ju Hwang paper
ICML'25 Compositional Condition Question Answering in Tabular Understanding Han-Jia Ye paper
ICML'25 Are Large Language Models Ready for Multi-Turn Tabular Data Analysis? Reynold Cheng paper
ICML'25 Agent Workflow Memory Graham Neubig paper
ICLR'25 SeCom: On Memory Construction and Retrieval for Personalized Conversational Agents Qianhui Wu paper
ICLR'25 InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation Issam Hadj Laradji paper
ICLR'25 Agent-Oriented Planning in Multi-Agent Systems Yaliang Li paper
ICLR'25 Oral AFlow: Automating Agentic Workflow Generation Chenglin Wu paper
COLM'25 Inducing Programmatic Skills for Agentic Tasks Zora Zhiruo Wang, Apurva Gandhi, Graham Neubig, Daniel Fried paper
NAACL'25 H-STAR: LLM-driven Hybrid SQL-Text Adaptive Reasoning on Tables Chandan K. Reddy paper
ACL'25 Table-Critic: A Multi-Agent Framework for Collaborative Criticism and Refinement in Table Reasoning Peiying Yu, Jingjing Wang paper
ACL'25 Findings Data Interpreter: An LLM Agent For Data Science Bang Liu & Chenglin Wu paper
Neurips'25 Table as a Modality for Large Language Models Junbo Zhao paper
Arxiv/2512 Beyond Sliding Windows: Learning to Manage Memory in Non-Markovian Environments Tim Klinger paper
Arxiv/2512 MemR3: Memory Retrieval via Reflective Reasoning for LLM Agents Song Le paper
Arxiv/2512 Learning Hierarchical Procedural Memory for LLM Agents through Bayesian Selection and Contrastive Refinement Mahdi Jalili paper
Arxiv/2512 Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution Zouying Cao, Zhaoyang Liu, Bolin Ding, Hai Zhao paper
Arxiv/2510 AgentFold: Long-Horizon Web Agents with Proactive Context Management Rui Ye, Siheng Chen paper
Arxiv/2510 Scaling Long-Horizon LLM Agent via Context-Folding Weiwei Sun paper
Arxiv/2510 LEGOMem: Modular Procedural Memory for Multi-agent LLM Systems for Workflow Automation Dongge Han paper
Arxiv/2510 TOOLMEM: Enhancing Multimodal Agents with Learnable Tool Capability Memory Zora Zhiruo Wang paper
Arxiv/2510 Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks Jitao Sang paper
Arxiv/2510 LLM-based Multi-Agent Blackboard System for Information Discovery in Data Science Tomas Pfister paper
Arxiv/2510 DeepAnalyze: Agentic Large Language Models for Autonomous Data Science Ju Fan paper
Arxiv/2510 Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models Qizheng Zhang & Changran Hu paper
Arxiv/2509 Latent learning: episodic memory complements parametric learning by enabling flexible reuse of experiences Andrew Kyle Lampinen paper
Arxiv/2509 H2R: Hierarchical Hindsight Reflection for Multi-Task LLM Agents Chengdong Xu paper
Arxiv/2509 Mem-α: Learning Memory Construction via Reinforcement Learning Yu Wang paper
Arxiv/2509 ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory Siru Ouyang paper
Arxiv/2509 SGMem: Sentence Graph Memory for Long-Term Conversational Agents Yaxiong Wu paper
Arxiv/2509 TableMind: An Autonomous Programmatic Agent for Tool-Augmented Table Reasoning Qi Liu paper
Arxiv/2509 MemGen: Weaving Generative Latent Memory for Self-Evolving Agents Shuicheng Yan paper
Arxiv/2508 Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models Zhouhan Lin paper
Arxiv/2508 Memory-R1: Enhancing Large Language Model Agents to Actively Manage and Utilize External Memory Yunpu Ma paper
Arxiv/2508 Data Agent: A Holistic Architecture for Orchestrating Data+AI Ecosystems Guoliang Li paper
Arxiv/2508 Memp: Exploring Agent Procedural Memory Ningyu Zhang paper
Arxiv/2508 Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory Wei Li paper
Arxiv/2508 AgenticData: An Agentic Data Analytics System for Heterogeneous Data Yuan Li paper
Arxiv/2508 Multiple Memory Systems for Enhancing the Long-term Memory of Agent Bo Wang paper
Arxiv/2507 H-MEM: Hierarchical Memory for High-Efficiency Long-Term Reasoning in LLM Agents Shaoning Zeng paper
Arxiv/2507 MemOS: A Memory OS for AI System Siheng Chen, Wentao Zhang, Zhi-Qin John Xu, Feiyu Xiong paper
Arxiv/2507 MIRIX: Multi-Agent Memory System for LLM-Based Agents Xi Chen paper
Arxiv/2507 MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent Hao Zhou paper
Arxiv/2506 MAPLE: Multi-Agent Adaptive Planning with Long-Term Memory for Table Reasoning Thuy-Trang Vu paper
Arxiv/2506 Memory OS of AIAgent Ting Bai paper
Arxiv/2506 G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems Shuicheng Yan paper
Arxiv/2506 AutoMind: Adaptive Knowledgeable Agent for Automated Data Science Ningyu Zhang paper
Arxiv/2505 DSMentor: Enhancing Data Science Agents with Curriculum Learning and Online Knowledge Accumulation Patrick Ng paper
Arxiv/2505 TAIJI: MCP-based Multi-Modal Data Analytics on Data Lakes Ju Fan paper
Arxiv/2505 How Memory Management Impacts LLM Agents: An Empirical Study of Experience-Following Behavior Zidi Xiong paper
Arxiv'2505 Weaver: Interweaving SQL and LLM for Table Reasoning Vivek Gupta paper
Arxiv/2504 AgentAda: Skill-Adaptive Data Analytics for Tailored Insight Discovery Issam H. Laradji paper
Arxiv/2504 Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory Deshraj Yadav paper
Arxiv/2503 R3Mem: Bridging Memory Retention and Retrieval via Reversible Compression Yun Zhu paper
Arxiv/2503 DatawiseAgent: A Notebook-Centric LLM Agent Framework for Automated Data Science Yu Huang paper
Arxiv/2503 DAgent: A Relational Database-Driven Data Analysis Report Generation Agent Yunjun Gao paper
Arxiv/2502 A-MEM: Agentic Memory for LLM Agents Yongfeng Zhang paper
Arxiv'2501 TableMaster: A Recipe to Advance Table Understanding with Language Models Hanbing Liu paper
Arxiv'2501 ChartInsighter: An Approach for Mitigating Hallucination in Time-series Chart Summary Generation with A Benchmark Dataset Fen Wang, Siming Chen paper
Arxiv'2410 AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competition Wenhao Huang & Ge Zhang paper
Arxiv'2407 LAMBDA: A Large Model Based Data Agent Yancheng Yuan & Jian Huang paper
ICML'24 DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning Hechang Chen paper
ICLR'24 Synapse: Trajectory-as-exemplar prompting with memory for computer control Bo An paper
ICLR'24 OpenTab: Advancing Large Language Models as Open-domain Table Reasoners Jiani Zhang paper
ICLR'24 CABINET: Content Relevance based Noise Reduction for Table Question Answering Balaji Krishnamurthy paper
ICLR'24 Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection Hannaneh Hajishirzi paper
ICLR'24 Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding Tomas Pfister paper
ICLR'24 MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework Chenglin Wu paper
AAAI'24 ExpeL: LLM Agents Are Experiential Learners Gao Huang paper
VLDB'24 ReAcTable: Enhancing ReAct for Table Question Answering Jignesh M. Patel paper
VLDB'24 AutoTQA: Towards Autonomous Tabular Question Answering through Multi-Agent Large Language Models Qi Liu paper
VLDB'24 D-Bot: Database Diagnosis System using Large Language Models Guoliang Li paper
NIPS'23 Augmenting language models with long-term memory Furu Wei paper
NIPS'23 Reflexion: Language Agents with Verbal Reinforcement Learning Shunyu Yao paper
IEEE VIS'23 What Exactly is an Insight? A Literature Review Alvitta Ottley paper
SIGIR'23 Large Language Models are Versatile Decomposers: Decompose Evidence and Questions for Table-based Reasoning (Dater) Yongbin Li paper
Arxiv'2310 MemGPT: Towards LLMs as Operating Systems Charles Packer paper

Survey Paper

A Survey of Data Agents: Emerging Paradigm or Overstated Hype?

Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions

A Survey on Large Language Model-based Agents for Statistics and Data Science

Large Language Model-based Data Science Agent: A Survey

LLM/Agent-as-Data-Analyst: A Survey

About

Papers for LLM-based Data Agent

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •