Awesome Data Agent and Agent Memory Paper

Benchmark

Data Agent Benchmark

Benchmark Name	Corresp. Author	Year	Source	TLDR
InfiAgent-DABench	Jingjing Xu	2024	Link Repo	Benchmark from ICML'24 Paper. Benchmark for evaluating agents on closed-form data analysis tasks. Sub-tasks: Data querying, processing, and reasoning over CSV files using Python. Input: Natural language questions paired with CSV files. Output: Closed-form answers (e.g., specific values or strings) matched against canonical labels.
FDABench	Gao Cong	2025	Link Repo	Benchmark for agents in multi-source analytical scenarios. Sub-tasks: Single-source structured analysis, unstructured data retrieval, and cross-source heterogeneous data fusion. Input: Analytical queries spanning structured (DB) and unstructured (PDF/Audio) sources. Output: SQL queries, precise values, or comprehensive analytical reports. Metrics: Exact-match for MC/checkbox tasks; text tasks by overlap metrics (e.g., ROUGE) plus tool-use success/recall; also reports efficiency (latency, model/tool calls, token/cost).
DSBench	Dong Yu	2025	Link Repo	Benchmark from ICLR'25 Paper. Realistic benchmark evaluating data science agents on complex tasks. Sub-tasks: Data cleaning, transformation, visualization, and modeling using libraries like Pandas/Scikit-learn. Input: Long-horizon problem descriptions with access to live data stacks. Output: Executable code and final artifacts (charts, tables, or answers). Metrics: Per-task 0–5 rubric with penalties for extra user interaction, corrections, or hallucinations; summarize overall score and hallucination rate.
DABStep	Thomas Wolf	2025	Link Repo	Benchmark focusing on iterative multi-step reasoning in financial analytics. Sub-tasks: Code-based data manipulation and cross-referencing structured tables with unstructured documentation. Input: Complex queries requiring multi-step navigation of heterogeneous data. Output: Factoid-style answers (string/numeric) verifiable by automatic scoring.
DataSciBench	Yisong Yue	2025	Link Repo	Comprehensive benchmark evaluating uncertain ground truth using the Task-Function-Code framework. Sub-tasks: Data cleaning, exploration, visualization, predictive modeling, and report generation. Input: Natural language prompts accompanied by datasets. Output: Executable code and valid final answers derived from the data.
DA-Code	Kang Liu	2024	Link Repo	Challenging benchmark for agentic code generation in data science. Sub-tasks: Complex data wrangling, analytics via code generation. Input: Natural language descriptions of domain-specific problems within input data.
DSEval	Kan Ren	2024	Link Repo	Evaluation paradigm covering the full data science lifecycle via bootstrapped annotation. Sub-tasks: Problem definition, data processing, and modeling across datasets like Kaggle or LeetCode. Input: Sequences of interdependent data science problems. Output: Code solutions and execution results for each iteration.
WebDS	Christopher D. Manning	2025	Link	End-to-end web-based benchmark reflecting real-world analytics workflows. Sub-tasks: Autonomous web browsing, data acquisition, cleaning, analysis, and visualization. Input: High-level analytical goals with access to 29 diverse websites. Output: Summarized analyses and insights.
PredictiQ	Xiaojun Ma	2025	Link Repo	Benchmark specialized in predictive analysis capabilities across diverse fields. Sub-tasks: Text analysis and code generation for prediction tasks. Input: Sophisticated predictive queries paired with real-world datasets. Output: Prediction results with verified text-code alignment.
InsightBench	Issam Hadj Laradji	2025	Paper Repo	Benchmark for end-to-end business analytics and insight discovery. Sub-tasks: Formulating questions, interpreting answers, and summarizing findings. Input: Business datasets with high-level analytic goals. Output: Summary of discovered insights and actionable steps.

Latest Benchmarks for Tabular Data (Since 2024)

Benchmark Name	Task Type	Year	Source	TLDR
MDBench	Reasoning	2025	Link	MDBench introduces a new multi-document reasoning benchmark synthetically generated through knowledge-guided prompting.
MMQA	Reasoning	2025	Link Repo	MMQA is a multi-table multi-hop question answering dataset with 3,312 tables across 138 domains, evaluating LLMs' capabilities in multi-table retrieval, Text-to-SQL, Table QA, and primary/foreign key selection.
ToRR	Reasoning	2025	Link Repo	ToRR is a benchmark assessing LLMs' table reasoning and robustness across 10 datasets with diverse table serializations and perturbations, revealing models' brittleness to format variations.
MMTU	Comprehensive	2025	Link Repo	MMTU is a massive multi-task table understanding and reasoning benchmark with over 30K questions across 25 real-world table tasks, designed to evaluate models' ability to understand, reason, and manipulate tables.
RADAR	Reasoning	2025	Link Repo	RADAR is a benchmark for evaluating language models' data-aware reasoning on imperfect tabular data with 5 common data artifact types like outlier value or inconsistent format, which ensures that direct calculation on the perturbed table will yield an incorrect answer, forcing the model to handle the artifacts to obtain the correct result.
Spider2	Text2SQL	2025	Link Repo	Evaluation framework for real-world enterprise text-to-SQL workflows. Sub-tasks: Interacting with complex SQL environments (BigQuery, Snowflake), handling diverse operations, and processing long contexts. Input: Natural language questions with enterprise-level database schemas. Output: Complex SQL queries (often >100 lines) to solve the workflow.
DataBench	Reasoning	2024	Link Repo	Benchmark for Question Answering over Tabular Data assessing semantic reasoning. Sub-tasks: Answering questions requiring numerical, boolean, or categorical reasoning over diverse datasets. Input: Natural language questions paired with 65 real-world tabular datasets (CSV/Parquet). Output: Exact answer values (Boolean, Number, Category, or List) derived from the table.
TableBench	Reasoning	2024	Link Repo	Comprehensive benchmark for Table QA covering 18 fields and 4 complexity categories. Sub-tasks: Fact-checking, numerical reasoning, data analysis, and visualization. Input: Natural language questions paired with tables emphasizing numerical data. Output: Final answers derived through complex reasoning steps (e.g., Chain-of-Thought).
TQA-Bench	Reasoning	2024	Link Repo	Benchmark for evaluating LLMs on multi-table question answering with scalable context. Sub-tasks: Reasoning across multiple interconnected tables and handling long-context serialization (up to 64k tokens). Input: Natural language questions paired with multi-table relational databases. Output: Answers or SQL queries derived from joining and analyzing multiple tables.
SpreadsheetBench	Reasoning	2024	Link Repo	Benchmark for spreadsheet manipulation derived exclusively from real-world scenarios. Sub-tasks: Find, extract, sum, highlight, remove, modify, count, delete, calculate, and display. Input: Real-world user instructions from Excel forums paired with complex spreadsheet files. Output: Modified spreadsheet files or specific values matching the instruction.

Selected Classical Reasoning Benchmarks for Tabular Data

Benchmark Name	Task Type	Year	Source
FinQA	Reasoning	2021	Link Repo
FeTaQA	Reasoning	2021	Link Repo
HiTab	Reasoning	2022	Link Repo

Conference Paper (including arxiv)

Venue	Paper	Authors	Links
SIGMOD'26	ST-Raptor: LLM-Powered Semi-Structured Table Question Answering	Xuanhe Zhou	paper
Arxiv/2601	Can We Predict Before Executing Machine Learning Agents?	Ningyu Zhang	paper
CIDR'25	AOP: Automated and Interactive LLM Pipeline Orchestration for Answering Complex Queries	Guoliang Li	paper
CIDR'25	Palimpzest: Optimizing AI-Powered Analytics with Declarative Query Processing	Gerardo Vitagliano	paper
IEEE Data Eng. Bull.	iDataLake: An LLM-Powered Analytics System on Data Lakes	Guoliang Li	paper
VLDB'25	Towards Automated Cross-domain Exploratory Data Analysis through Large Language Models	Qi Liu	paper
VLDB'25	AutoPrep: Natural Language Question-Aware Data Preparation with a Multi-Agent Framework	Ju Fan	paper
VLDB'25	DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing	Eugene Wu	paper
VLDB'25	Data Imputation with Limited Data Redundancy Using Data Lakes	Yuyu Luo	paper
VLDB'25	Semantic Operators and Their Optimization: Enabling LLM-Based Data Processing with Accuracy Guarantees in LOTUS	Matei Zaharia	paper
VLDB'25	Weak-to-Strong Prompts with Lightweight-to-Powerful LLMs for High-Accuracy, Low-Cost, and Explainable Data Transformation	Yuyu Luo, Ju Fan, and Nan Tang et al	paper
SIGMOD'25	Automatic Database Configuration Debugging using Retrieval-Augmented Language Models	Nan Tang, Ju Fan, et al	paper
SIGMOD'25	Andromeda: Debugging Database Performance Issues with Retrieval-Augmented Large Language Models	Nan Tang, Ju Fan, Xiaoyong Du, et al	paper
ICDE'25	DataLab: A Unified Platform for LLM-Powered Business Intelligence	Wei Chen	paper
ICML'25	AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML	Sung Ju Hwang	paper
ICML'25	Compositional Condition Question Answering in Tabular Understanding	Han-Jia Ye	paper
ICML'25	Are Large Language Models Ready for Multi-Turn Tabular Data Analysis?	Reynold Cheng	paper
ICML'25	Agent Workflow Memory	Graham Neubig	paper
ICLR'25	SeCom: On Memory Construction and Retrieval for Personalized Conversational Agents	Qianhui Wu	paper
ICLR'25	InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation	Issam Hadj Laradji	paper
ICLR'25	Agent-Oriented Planning in Multi-Agent Systems	Yaliang Li	paper
ICLR'25 Oral	AFlow: Automating Agentic Workflow Generation	Chenglin Wu	paper
COLM'25	Inducing Programmatic Skills for Agentic Tasks	Zora Zhiruo Wang, Apurva Gandhi, Graham Neubig, Daniel Fried	paper
NAACL'25	H-STAR: LLM-driven Hybrid SQL-Text Adaptive Reasoning on Tables	Chandan K. Reddy	paper
ACL'25	Table-Critic: A Multi-Agent Framework for Collaborative Criticism and Refinement in Table Reasoning	Peiying Yu, Jingjing Wang	paper
ACL'25 Findings	Data Interpreter: An LLM Agent For Data Science	Bang Liu & Chenglin Wu	paper
Neurips'25	Table as a Modality for Large Language Models	Junbo Zhao	paper
Arxiv/2512	Beyond Sliding Windows: Learning to Manage Memory in Non-Markovian Environments	Tim Klinger	paper
Arxiv/2512	MemR3: Memory Retrieval via Reflective Reasoning for LLM Agents	Song Le	paper
Arxiv/2512	Learning Hierarchical Procedural Memory for LLM Agents through Bayesian Selection and Contrastive Refinement	Mahdi Jalili	paper
Arxiv/2512	Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution	Zouying Cao, Zhaoyang Liu, Bolin Ding, Hai Zhao	paper
Arxiv/2510	AgentFold: Long-Horizon Web Agents with Proactive Context Management	Rui Ye, Siheng Chen	paper
Arxiv/2510	Scaling Long-Horizon LLM Agent via Context-Folding	Weiwei Sun	paper
Arxiv/2510	LEGOMem: Modular Procedural Memory for Multi-agent LLM Systems for Workflow Automation	Dongge Han	paper
Arxiv/2510	TOOLMEM: Enhancing Multimodal Agents with Learnable Tool Capability Memory	Zora Zhiruo Wang	paper
Arxiv/2510	Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks	Jitao Sang	paper
Arxiv/2510	LLM-based Multi-Agent Blackboard System for Information Discovery in Data Science	Tomas Pfister	paper
Arxiv/2510	DeepAnalyze: Agentic Large Language Models for Autonomous Data Science	Ju Fan	paper
Arxiv/2510	Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models	Qizheng Zhang & Changran Hu	paper
Arxiv/2509	Latent learning: episodic memory complements parametric learning by enabling flexible reuse of experiences	Andrew Kyle Lampinen	paper
Arxiv/2509	H2R: Hierarchical Hindsight Reflection for Multi-Task LLM Agents	Chengdong Xu	paper
Arxiv/2509	Mem-α: Learning Memory Construction via Reinforcement Learning	Yu Wang	paper
Arxiv/2509	ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory	Siru Ouyang	paper
Arxiv/2509	SGMem: Sentence Graph Memory for Long-Term Conversational Agents	Yaxiong Wu	paper
Arxiv/2509	TableMind: An Autonomous Programmatic Agent for Tool-Augmented Table Reasoning	Qi Liu	paper
Arxiv/2509	MemGen: Weaving Generative Latent Memory for Self-Evolving Agents	Shuicheng Yan	paper
Arxiv/2508	Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models	Zhouhan Lin	paper
Arxiv/2508	Memory-R1: Enhancing Large Language Model Agents to Actively Manage and Utilize External Memory	Yunpu Ma	paper
Arxiv/2508	Data Agent: A Holistic Architecture for Orchestrating Data+AI Ecosystems	Guoliang Li	paper
Arxiv/2508	Memp: Exploring Agent Procedural Memory	Ningyu Zhang	paper
Arxiv/2508	Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory	Wei Li	paper
Arxiv/2508	AgenticData: An Agentic Data Analytics System for Heterogeneous Data	Yuan Li	paper
Arxiv/2508	Multiple Memory Systems for Enhancing the Long-term Memory of Agent	Bo Wang	paper
Arxiv/2507	H-MEM: Hierarchical Memory for High-Efficiency Long-Term Reasoning in LLM Agents	Shaoning Zeng	paper
Arxiv/2507	MemOS: A Memory OS for AI System	Siheng Chen, Wentao Zhang, Zhi-Qin John Xu, Feiyu Xiong	paper
Arxiv/2507	MIRIX: Multi-Agent Memory System for LLM-Based Agents	Xi Chen	paper
Arxiv/2507	MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent	Hao Zhou	paper
Arxiv/2506	MAPLE: Multi-Agent Adaptive Planning with Long-Term Memory for Table Reasoning	Thuy-Trang Vu	paper
Arxiv/2506	Memory OS of AIAgent	Ting Bai	paper
Arxiv/2506	G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems	Shuicheng Yan	paper
Arxiv/2506	AutoMind: Adaptive Knowledgeable Agent for Automated Data Science	Ningyu Zhang	paper
Arxiv/2505	DSMentor: Enhancing Data Science Agents with Curriculum Learning and Online Knowledge Accumulation	Patrick Ng	paper
Arxiv/2505	TAIJI: MCP-based Multi-Modal Data Analytics on Data Lakes	Ju Fan	paper
Arxiv/2505	How Memory Management Impacts LLM Agents: An Empirical Study of Experience-Following Behavior	Zidi Xiong	paper
Arxiv'2505	Weaver: Interweaving SQL and LLM for Table Reasoning	Vivek Gupta	paper
Arxiv/2504	AgentAda: Skill-Adaptive Data Analytics for Tailored Insight Discovery	Issam H. Laradji	paper
Arxiv/2504	Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory	Deshraj Yadav	paper
Arxiv/2503	R3Mem: Bridging Memory Retention and Retrieval via Reversible Compression	Yun Zhu	paper
Arxiv/2503	DatawiseAgent: A Notebook-Centric LLM Agent Framework for Automated Data Science	Yu Huang	paper
Arxiv/2503	DAgent: A Relational Database-Driven Data Analysis Report Generation Agent	Yunjun Gao	paper
Arxiv/2502	A-MEM: Agentic Memory for LLM Agents	Yongfeng Zhang	paper
Arxiv'2501	TableMaster: A Recipe to Advance Table Understanding with Language Models	Hanbing Liu	paper
Arxiv'2501	ChartInsighter: An Approach for Mitigating Hallucination in Time-series Chart Summary Generation with A Benchmark Dataset	Fen Wang, Siming Chen	paper
Arxiv'2410	AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competition	Wenhao Huang & Ge Zhang	paper
Arxiv'2407	LAMBDA: A Large Model Based Data Agent	Yancheng Yuan & Jian Huang	paper
ICML'24	DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning	Hechang Chen	paper
ICLR'24	Synapse: Trajectory-as-exemplar prompting with memory for computer control	Bo An	paper
ICLR'24	OpenTab: Advancing Large Language Models as Open-domain Table Reasoners	Jiani Zhang	paper
ICLR'24	CABINET: Content Relevance based Noise Reduction for Table Question Answering	Balaji Krishnamurthy	paper
ICLR'24	Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection	Hannaneh Hajishirzi	paper
ICLR'24	Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding	Tomas Pfister	paper
ICLR'24	MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework	Chenglin Wu	paper
AAAI'24	ExpeL: LLM Agents Are Experiential Learners	Gao Huang	paper
VLDB'24	ReAcTable: Enhancing ReAct for Table Question Answering	Jignesh M. Patel	paper
VLDB'24	AutoTQA: Towards Autonomous Tabular Question Answering through Multi-Agent Large Language Models	Qi Liu	paper
VLDB'24	D-Bot: Database Diagnosis System using Large Language Models	Guoliang Li	paper
NIPS'23	Augmenting language models with long-term memory	Furu Wei	paper
NIPS'23	Reflexion: Language Agents with Verbal Reinforcement Learning	Shunyu Yao	paper
IEEE VIS'23	What Exactly is an Insight? A Literature Review	Alvitta Ottley	paper
SIGIR'23	Large Language Models are Versatile Decomposers: Decompose Evidence and Questions for Table-based Reasoning (Dater)	Yongbin Li	paper
Arxiv'2310	MemGPT: Towards LLMs as Operating Systems	Charles Packer	paper

Survey Paper

A Survey of Data Agents: Emerging Paradigm or Overstated Hype?

Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions

A Survey on Large Language Model-based Agents for Statistics and Data Science

Large Language Model-based Data Science Agent: A Survey

LLM/Agent-as-Data-Analyst: A Survey

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Data Agent and Agent Memory Paper

Benchmark

Data Agent Benchmark

Latest Benchmarks for Tabular Data (Since 2024)

Selected Classical Reasoning Benchmarks for Tabular Data

Conference Paper (including arxiv)

Survey Paper

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

SJTU-DMTai/Data-Agent-Reading-List

Folders and files

Latest commit

History

Repository files navigation

Awesome Data Agent and Agent Memory Paper

Benchmark

Data Agent Benchmark

Latest Benchmarks for Tabular Data (Since 2024)

Selected Classical Reasoning Benchmarks for Tabular Data

Conference Paper (including arxiv)

Survey Paper

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Packages