| Benchmark Name | Corresp. Author | Year | Source | TLDR |
|---|---|---|---|---|
| InfiAgent-DABench | Jingjing Xu | 2024 | Link Repo | Benchmark from ICML'24 Paper. Benchmark for evaluating agents on closed-form data analysis tasks. Sub-tasks: Data querying, processing, and reasoning over CSV files using Python. Input: Natural language questions paired with CSV files. Output: Closed-form answers (e.g., specific values or strings) matched against canonical labels. |
| FDABench | Gao Cong | 2025 | Link Repo | Benchmark for agents in multi-source analytical scenarios. Sub-tasks: Single-source structured analysis, unstructured data retrieval, and cross-source heterogeneous data fusion. Input: Analytical queries spanning structured (DB) and unstructured (PDF/Audio) sources. Output: SQL queries, precise values, or comprehensive analytical reports. Metrics: Exact-match for MC/checkbox tasks; text tasks by overlap metrics (e.g., ROUGE) plus tool-use success/recall; also reports efficiency (latency, model/tool calls, token/cost). |
| DSBench | Dong Yu | 2025 | Link Repo | Benchmark from ICLR'25 Paper. Realistic benchmark evaluating data science agents on complex tasks. Sub-tasks: Data cleaning, transformation, visualization, and modeling using libraries like Pandas/Scikit-learn. Input: Long-horizon problem descriptions with access to live data stacks. Output: Executable code and final artifacts (charts, tables, or answers). Metrics: Per-task 0–5 rubric with penalties for extra user interaction, corrections, or hallucinations; summarize overall score and hallucination rate. |
| DABStep | Thomas Wolf | 2025 | Link Repo | Benchmark focusing on iterative multi-step reasoning in financial analytics. Sub-tasks: Code-based data manipulation and cross-referencing structured tables with unstructured documentation. Input: Complex queries requiring multi-step navigation of heterogeneous data. Output: Factoid-style answers (string/numeric) verifiable by automatic scoring. |
| DataSciBench | Yisong Yue | 2025 | Link Repo | Comprehensive benchmark evaluating uncertain ground truth using the Task-Function-Code framework. Sub-tasks: Data cleaning, exploration, visualization, predictive modeling, and report generation. Input: Natural language prompts accompanied by datasets. Output: Executable code and valid final answers derived from the data. |
| DA-Code | Kang Liu | 2024 | Link Repo | Challenging benchmark for agentic code generation in data science. Sub-tasks: Complex data wrangling, analytics via code generation. Input: Natural language descriptions of domain-specific problems within input data. |
| DSEval | Kan Ren | 2024 | Link Repo | Evaluation paradigm covering the full data science lifecycle via bootstrapped annotation. Sub-tasks: Problem definition, data processing, and modeling across datasets like Kaggle or LeetCode. Input: Sequences of interdependent data science problems. Output: Code solutions and execution results for each iteration. |
| WebDS | Christopher D. Manning | 2025 | Link | End-to-end web-based benchmark reflecting real-world analytics workflows. Sub-tasks: Autonomous web browsing, data acquisition, cleaning, analysis, and visualization. Input: High-level analytical goals with access to 29 diverse websites. Output: Summarized analyses and insights. |
| PredictiQ | Xiaojun Ma | 2025 | Link Repo | Benchmark specialized in predictive analysis capabilities across diverse fields. Sub-tasks: Text analysis and code generation for prediction tasks. Input: Sophisticated predictive queries paired with real-world datasets. Output: Prediction results with verified text-code alignment. |
| InsightBench | Issam Hadj Laradji | 2025 | Paper Repo | Benchmark for end-to-end business analytics and insight discovery. Sub-tasks: Formulating questions, interpreting answers, and summarizing findings. Input: Business datasets with high-level analytic goals. Output: Summary of discovered insights and actionable steps. |
| Benchmark Name | Task Type | Year | Source | TLDR |
|---|---|---|---|---|
| MDBench | Reasoning | 2025 | Link | MDBench introduces a new multi-document reasoning benchmark synthetically generated through knowledge-guided prompting. |
| MMQA | Reasoning | 2025 | Link Repo | MMQA is a multi-table multi-hop question answering dataset with 3,312 tables across 138 domains, evaluating LLMs' capabilities in multi-table retrieval, Text-to-SQL, Table QA, and primary/foreign key selection. |
| ToRR | Reasoning | 2025 | Link Repo | ToRR is a benchmark assessing LLMs' table reasoning and robustness across 10 datasets with diverse table serializations and perturbations, revealing models' brittleness to format variations. |
| MMTU | Comprehensive | 2025 | Link Repo | MMTU is a massive multi-task table understanding and reasoning benchmark with over 30K questions across 25 real-world table tasks, designed to evaluate models' ability to understand, reason, and manipulate tables. |
| RADAR | Reasoning | 2025 | Link Repo | RADAR is a benchmark for evaluating language models' data-aware reasoning on imperfect tabular data with 5 common data artifact types like outlier value or inconsistent format, which ensures that direct calculation on the perturbed table will yield an incorrect answer, forcing the model to handle the artifacts to obtain the correct result. |
| Spider2 | Text2SQL | 2025 | Link Repo | Evaluation framework for real-world enterprise text-to-SQL workflows. Sub-tasks: Interacting with complex SQL environments (BigQuery, Snowflake), handling diverse operations, and processing long contexts. Input: Natural language questions with enterprise-level database schemas. Output: Complex SQL queries (often >100 lines) to solve the workflow. |
| DataBench | Reasoning | 2024 | Link Repo | Benchmark for Question Answering over Tabular Data assessing semantic reasoning. Sub-tasks: Answering questions requiring numerical, boolean, or categorical reasoning over diverse datasets. Input: Natural language questions paired with 65 real-world tabular datasets (CSV/Parquet). Output: Exact answer values (Boolean, Number, Category, or List) derived from the table. |
| TableBench | Reasoning | 2024 | Link Repo | Comprehensive benchmark for Table QA covering 18 fields and 4 complexity categories. Sub-tasks: Fact-checking, numerical reasoning, data analysis, and visualization. Input: Natural language questions paired with tables emphasizing numerical data. Output: Final answers derived through complex reasoning steps (e.g., Chain-of-Thought). |
| TQA-Bench | Reasoning | 2024 | Link Repo | Benchmark for evaluating LLMs on multi-table question answering with scalable context. Sub-tasks: Reasoning across multiple interconnected tables and handling long-context serialization (up to 64k tokens). Input: Natural language questions paired with multi-table relational databases. Output: Answers or SQL queries derived from joining and analyzing multiple tables. |
| SpreadsheetBench | Reasoning | 2024 | Link Repo | Benchmark for spreadsheet manipulation derived exclusively from real-world scenarios. Sub-tasks: Find, extract, sum, highlight, remove, modify, count, delete, calculate, and display. Input: Real-world user instructions from Excel forums paired with complex spreadsheet files. Output: Modified spreadsheet files or specific values matching the instruction. |
| Benchmark Name | Task Type | Year | Source |
|---|---|---|---|
| FinQA | Reasoning | 2021 | Link Repo |
| FeTaQA | Reasoning | 2021 | Link Repo |
| HiTab | Reasoning | 2022 | Link Repo |
| Venue | Paper | Authors | Links |
|---|---|---|---|
| SIGMOD'26 | ST-Raptor: LLM-Powered Semi-Structured Table Question Answering | Xuanhe Zhou | paper |
| Arxiv/2601 | Can We Predict Before Executing Machine Learning Agents? | Ningyu Zhang | paper |
| CIDR'25 | AOP: Automated and Interactive LLM Pipeline Orchestration for Answering Complex Queries | Guoliang Li | paper |
| CIDR'25 | Palimpzest: Optimizing AI-Powered Analytics with Declarative Query Processing | Gerardo Vitagliano | paper |
| IEEE Data Eng. Bull. | iDataLake: An LLM-Powered Analytics System on Data Lakes | Guoliang Li | paper |
| VLDB'25 | Towards Automated Cross-domain Exploratory Data Analysis through Large Language Models | Qi Liu | paper |
| VLDB'25 | AutoPrep: Natural Language Question-Aware Data Preparation with a Multi-Agent Framework | Ju Fan | paper |
| VLDB'25 | DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing | Eugene Wu | paper |
| VLDB'25 | Data Imputation with Limited Data Redundancy Using Data Lakes | Yuyu Luo | paper |
| VLDB'25 | Semantic Operators and Their Optimization: Enabling LLM-Based Data Processing with Accuracy Guarantees in LOTUS | Matei Zaharia | paper |
| VLDB'25 | Weak-to-Strong Prompts with Lightweight-to-Powerful LLMs for High-Accuracy, Low-Cost, and Explainable Data Transformation | Yuyu Luo, Ju Fan, and Nan Tang et al | paper |
| SIGMOD'25 | Automatic Database Configuration Debugging using Retrieval-Augmented Language Models | Nan Tang, Ju Fan, et al | paper |
| SIGMOD'25 | Andromeda: Debugging Database Performance Issues with Retrieval-Augmented Large Language Models | Nan Tang, Ju Fan, Xiaoyong Du, et al | paper |
| ICDE'25 | DataLab: A Unified Platform for LLM-Powered Business Intelligence | Wei Chen | paper |
| ICML'25 | AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML | Sung Ju Hwang | paper |
| ICML'25 | Compositional Condition Question Answering in Tabular Understanding | Han-Jia Ye | paper |
| ICML'25 | Are Large Language Models Ready for Multi-Turn Tabular Data Analysis? | Reynold Cheng | paper |
| ICML'25 | Agent Workflow Memory | Graham Neubig | paper |
| ICLR'25 | SeCom: On Memory Construction and Retrieval for Personalized Conversational Agents | Qianhui Wu | paper |
| ICLR'25 | InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation | Issam Hadj Laradji | paper |
| ICLR'25 | Agent-Oriented Planning in Multi-Agent Systems | Yaliang Li | paper |
| ICLR'25 Oral | AFlow: Automating Agentic Workflow Generation | Chenglin Wu | paper |
| COLM'25 | Inducing Programmatic Skills for Agentic Tasks | Zora Zhiruo Wang, Apurva Gandhi, Graham Neubig, Daniel Fried | paper |
| NAACL'25 | H-STAR: LLM-driven Hybrid SQL-Text Adaptive Reasoning on Tables | Chandan K. Reddy | paper |
| ACL'25 | Table-Critic: A Multi-Agent Framework for Collaborative Criticism and Refinement in Table Reasoning | Peiying Yu, Jingjing Wang | paper |
| ACL'25 Findings | Data Interpreter: An LLM Agent For Data Science | Bang Liu & Chenglin Wu | paper |
| Neurips'25 | Table as a Modality for Large Language Models | Junbo Zhao | paper |
| Arxiv/2512 | Beyond Sliding Windows: Learning to Manage Memory in Non-Markovian Environments | Tim Klinger | paper |
| Arxiv/2512 | MemR3: Memory Retrieval via Reflective Reasoning for LLM Agents | Song Le | paper |
| Arxiv/2512 | Learning Hierarchical Procedural Memory for LLM Agents through Bayesian Selection and Contrastive Refinement | Mahdi Jalili | paper |
| Arxiv/2512 | Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution | Zouying Cao, Zhaoyang Liu, Bolin Ding, Hai Zhao | paper |
| Arxiv/2510 | AgentFold: Long-Horizon Web Agents with Proactive Context Management | Rui Ye, Siheng Chen | paper |
| Arxiv/2510 | Scaling Long-Horizon LLM Agent via Context-Folding | Weiwei Sun | paper |
| Arxiv/2510 | LEGOMem: Modular Procedural Memory for Multi-agent LLM Systems for Workflow Automation | Dongge Han | paper |
| Arxiv/2510 | TOOLMEM: Enhancing Multimodal Agents with Learnable Tool Capability Memory | Zora Zhiruo Wang | paper |
| Arxiv/2510 | Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks | Jitao Sang | paper |
| Arxiv/2510 | LLM-based Multi-Agent Blackboard System for Information Discovery in Data Science | Tomas Pfister | paper |
| Arxiv/2510 | DeepAnalyze: Agentic Large Language Models for Autonomous Data Science | Ju Fan | paper |
| Arxiv/2510 | Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models | Qizheng Zhang & Changran Hu | paper |
| Arxiv/2509 | Latent learning: episodic memory complements parametric learning by enabling flexible reuse of experiences | Andrew Kyle Lampinen | paper |
| Arxiv/2509 | H2R: Hierarchical Hindsight Reflection for Multi-Task LLM Agents | Chengdong Xu | paper |
| Arxiv/2509 | Mem-α: Learning Memory Construction via Reinforcement Learning | Yu Wang | paper |
| Arxiv/2509 | ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory | Siru Ouyang | paper |
| Arxiv/2509 | SGMem: Sentence Graph Memory for Long-Term Conversational Agents | Yaxiong Wu | paper |
| Arxiv/2509 | TableMind: An Autonomous Programmatic Agent for Tool-Augmented Table Reasoning | Qi Liu | paper |
| Arxiv/2509 | MemGen: Weaving Generative Latent Memory for Self-Evolving Agents | Shuicheng Yan | paper |
| Arxiv/2508 | Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models | Zhouhan Lin | paper |
| Arxiv/2508 | Memory-R1: Enhancing Large Language Model Agents to Actively Manage and Utilize External Memory | Yunpu Ma | paper |
| Arxiv/2508 | Data Agent: A Holistic Architecture for Orchestrating Data+AI Ecosystems | Guoliang Li | paper |
| Arxiv/2508 | Memp: Exploring Agent Procedural Memory | Ningyu Zhang | paper |
| Arxiv/2508 | Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory | Wei Li | paper |
| Arxiv/2508 | AgenticData: An Agentic Data Analytics System for Heterogeneous Data | Yuan Li | paper |
| Arxiv/2508 | Multiple Memory Systems for Enhancing the Long-term Memory of Agent | Bo Wang | paper |
| Arxiv/2507 | H-MEM: Hierarchical Memory for High-Efficiency Long-Term Reasoning in LLM Agents | Shaoning Zeng | paper |
| Arxiv/2507 | MemOS: A Memory OS for AI System | Siheng Chen, Wentao Zhang, Zhi-Qin John Xu, Feiyu Xiong | paper |
| Arxiv/2507 | MIRIX: Multi-Agent Memory System for LLM-Based Agents | Xi Chen | paper |
| Arxiv/2507 | MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent | Hao Zhou | paper |
| Arxiv/2506 | MAPLE: Multi-Agent Adaptive Planning with Long-Term Memory for Table Reasoning | Thuy-Trang Vu | paper |
| Arxiv/2506 | Memory OS of AIAgent | Ting Bai | paper |
| Arxiv/2506 | G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems | Shuicheng Yan | paper |
| Arxiv/2506 | AutoMind: Adaptive Knowledgeable Agent for Automated Data Science | Ningyu Zhang | paper |
| Arxiv/2505 | DSMentor: Enhancing Data Science Agents with Curriculum Learning and Online Knowledge Accumulation | Patrick Ng | paper |
| Arxiv/2505 | TAIJI: MCP-based Multi-Modal Data Analytics on Data Lakes | Ju Fan | paper |
| Arxiv/2505 | How Memory Management Impacts LLM Agents: An Empirical Study of Experience-Following Behavior | Zidi Xiong | paper |
| Arxiv'2505 | Weaver: Interweaving SQL and LLM for Table Reasoning | Vivek Gupta | paper |
| Arxiv/2504 | AgentAda: Skill-Adaptive Data Analytics for Tailored Insight Discovery | Issam H. Laradji | paper |
| Arxiv/2504 | Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory | Deshraj Yadav | paper |
| Arxiv/2503 | R3Mem: Bridging Memory Retention and Retrieval via Reversible Compression | Yun Zhu | paper |
| Arxiv/2503 | DatawiseAgent: A Notebook-Centric LLM Agent Framework for Automated Data Science | Yu Huang | paper |
| Arxiv/2503 | DAgent: A Relational Database-Driven Data Analysis Report Generation Agent | Yunjun Gao | paper |
| Arxiv/2502 | A-MEM: Agentic Memory for LLM Agents | Yongfeng Zhang | paper |
| Arxiv'2501 | TableMaster: A Recipe to Advance Table Understanding with Language Models | Hanbing Liu | paper |
| Arxiv'2501 | ChartInsighter: An Approach for Mitigating Hallucination in Time-series Chart Summary Generation with A Benchmark Dataset | Fen Wang, Siming Chen | paper |
| Arxiv'2410 | AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competition | Wenhao Huang & Ge Zhang | paper |
| Arxiv'2407 | LAMBDA: A Large Model Based Data Agent | Yancheng Yuan & Jian Huang | paper |
| ICML'24 | DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning | Hechang Chen | paper |
| ICLR'24 | Synapse: Trajectory-as-exemplar prompting with memory for computer control | Bo An | paper |
| ICLR'24 | OpenTab: Advancing Large Language Models as Open-domain Table Reasoners | Jiani Zhang | paper |
| ICLR'24 | CABINET: Content Relevance based Noise Reduction for Table Question Answering | Balaji Krishnamurthy | paper |
| ICLR'24 | Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection | Hannaneh Hajishirzi | paper |
| ICLR'24 | Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding | Tomas Pfister | paper |
| ICLR'24 | MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework | Chenglin Wu | paper |
| AAAI'24 | ExpeL: LLM Agents Are Experiential Learners | Gao Huang | paper |
| VLDB'24 | ReAcTable: Enhancing ReAct for Table Question Answering | Jignesh M. Patel | paper |
| VLDB'24 | AutoTQA: Towards Autonomous Tabular Question Answering through Multi-Agent Large Language Models | Qi Liu | paper |
| VLDB'24 | D-Bot: Database Diagnosis System using Large Language Models | Guoliang Li | paper |
| NIPS'23 | Augmenting language models with long-term memory | Furu Wei | paper |
| NIPS'23 | Reflexion: Language Agents with Verbal Reinforcement Learning | Shunyu Yao | paper |
| IEEE VIS'23 | What Exactly is an Insight? A Literature Review | Alvitta Ottley | paper |
| SIGIR'23 | Large Language Models are Versatile Decomposers: Decompose Evidence and Questions for Table-based Reasoning (Dater) | Yongbin Li | paper |
| Arxiv'2310 | MemGPT: Towards LLMs as Operating Systems | Charles Packer | paper |
A Survey of Data Agents: Emerging Paradigm or Overstated Hype?
Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions
A Survey on Large Language Model-based Agents for Statistics and Data Science