Skip to content

[EMNLP 2025 Wordplay] LLM-Hanabi Evaluating Multi-Agent Gameplays with Theory-of-Mind and Rationale Inference in Imperfect Information Collaboration Game

Notifications You must be signed in to change notification settings

HKUST-KnowComp/LLM-Hanabi

Repository files navigation

LLM-Hanabi: A Benchmark for Theory-of-Mind in Multi-Agent Collaboration

This repository implements LLM-Hanabi, a benchmark for evaluating rationale inference and Theory-of-Mind (ToM) capabilities of Large Language Models (LLMs) in the cooperative card game Hanabi. It assesses how well LLMs infer others' intentions (1st-order ToM) and predict others' interpretations (2nd-order ToM) in a dynamic, collaborative setting with imperfect information.

Overview

Hanabi is a cooperative game where 2–5 players build firework stacks by playing cards in order, using limited hints to convey information. This codebase simulates games with LLM-driven agents, evaluates their ToM proficiency, and analyzes correlations between ToM and game performance.

LLM-Hanabi Workflow

Key Features

  • Game Simulation: Supports 2–5 players with configurable tokens and AI strategies (e.g., Chain-of-Thought, Adaptive Behavior Design).
  • ToM Evaluation: Scores 1st-order (0–10) and 2nd-order (0–5) ToM based on agents' rationales and actions.
  • Correlation Analysis: Computes Pearson correlations between ToM scores and game scores.
  • Logging: Saves game logs, ToM records, and summaries in JSON/CSV formats.
  • Scalability: Uses multiprocessing for parallel game simulations.

Repository Structure

  • hanabi.py: Main script to run game simulations, manage player groups, and compute ToM and game score correlations.
  • HanabiEnv.py: Implements the Hanabi game environment, providing interfaces for different agents to interact with the game.
  • Agents.py: Defines LLM-driven agent classes (LLMsAgent, Basic_LLMsAgent, CoT_LLMsAgent, ABD_LLMsAgent). New agent types can be added here.
  • ToM_eval.py: Evaluates 1st-order and 2nd-order ToM scores based on agents' rationales and actions.
  • call_api.py: Handles API calls to LLM providers (e.g., OpenRouter).
  • players_groups.yaml: Configures player groups, including models, strategies, and parameters. The ABD strategy enables ToM-based reasoning and scoring.

Installation

  1. Clone Repository:
git clone [email protected]:HKUST-Knowcomp/LLM-Hanabi.git
cd ToMHanabi
  1. Install Dependencies: Install the required Python packages using the provided requirements.txt:
pip install -r requirements.txt
  1. Configure Environment:

    • API Keys: Update call_api.py with your API tokens to enable LLM interactions.
    • Player Settings: Modify players_groups.yaml to configure players, count or temperature
  2. Run Simulations: Execute the main script to run games with your desired configuration:

python hanabi.py --group Single_model_group --game_name LLM-Hanabi --batch 30 --num_processes 15
  • --group: Specify the player group from players_groups.yaml (e.g., Single_model_group).
  • --game_name: Set a custom name for the game (e.g., LLM-Hanabi).
  • --batch: Number of games to simulate (e.g., 30).
  • --num_processes: Number of parallel processes (e.g., 15, adjust based on your system's capabilities).
  • --log: Add this flag to record detailed model responses in <num_players>_players-<game_name>-log.json.

Environment Configurations

To replicate this environment, use:

pip install -r requirements.txt

Outputs

Results are stored in the game_log/ folder:

  1. <num_players>_players-<game_name>-record.csv: Detailed scores for each game (game score, ToM1, ToM2, rounds).
  2. <num_players>_players-<game_name>-summary.json: Game configuration and summary statistics (average score, std, highest/lowest scores, ToM scores, correlations).
  3. ToM_record.json: ToM scores and rationales for each game.
  4. <num_players>_players-<game_name>-log.json: Detailed model responses for each game (if --log is enabled).

About

[EMNLP 2025 Wordplay] LLM-Hanabi Evaluating Multi-Agent Gameplays with Theory-of-Mind and Rationale Inference in Imperfect Information Collaboration Game

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages