This repository implements LLM-Hanabi, a benchmark for evaluating rationale inference and Theory-of-Mind (ToM) capabilities of Large Language Models (LLMs) in the cooperative card game Hanabi. It assesses how well LLMs infer others' intentions (1st-order ToM) and predict others' interpretations (2nd-order ToM) in a dynamic, collaborative setting with imperfect information.
Hanabi is a cooperative game where 2–5 players build firework stacks by playing cards in order, using limited hints to convey information. This codebase simulates games with LLM-driven agents, evaluates their ToM proficiency, and analyzes correlations between ToM and game performance.
- Game Simulation: Supports 2–5 players with configurable tokens and AI strategies (e.g., Chain-of-Thought, Adaptive Behavior Design).
- ToM Evaluation: Scores 1st-order (0–10) and 2nd-order (0–5) ToM based on agents' rationales and actions.
- Correlation Analysis: Computes Pearson correlations between ToM scores and game scores.
- Logging: Saves game logs, ToM records, and summaries in JSON/CSV formats.
- Scalability: Uses multiprocessing for parallel game simulations.
hanabi.py: Main script to run game simulations, manage player groups, and compute ToM and game score correlations.HanabiEnv.py: Implements the Hanabi game environment, providing interfaces for different agents to interact with the game.Agents.py: Defines LLM-driven agent classes (LLMsAgent,Basic_LLMsAgent,CoT_LLMsAgent,ABD_LLMsAgent). New agent types can be added here.ToM_eval.py: Evaluates 1st-order and 2nd-order ToM scores based on agents' rationales and actions.call_api.py: Handles API calls to LLM providers (e.g., OpenRouter).players_groups.yaml: Configures player groups, including models, strategies, and parameters. TheABDstrategy enables ToM-based reasoning and scoring.
- Clone Repository:
git clone [email protected]:HKUST-Knowcomp/LLM-Hanabi.git
cd ToMHanabi- Install Dependencies:
Install the required Python packages using the provided
requirements.txt:
pip install -r requirements.txt-
Configure Environment:
- API Keys: Update
call_api.pywith your API tokens to enable LLM interactions. - Player Settings: Modify
players_groups.yamlto configureplayers,countortemperature
- API Keys: Update
-
Run Simulations: Execute the main script to run games with your desired configuration:
python hanabi.py --group Single_model_group --game_name LLM-Hanabi --batch 30 --num_processes 15--group: Specify the player group fromplayers_groups.yaml(e.g.,Single_model_group).--game_name: Set a custom name for the game (e.g.,LLM-Hanabi).--batch: Number of games to simulate (e.g.,30).--num_processes: Number of parallel processes (e.g.,15, adjust based on your system's capabilities).--log: Add this flag to record detailed model responses in<num_players>_players-<game_name>-log.json.
To replicate this environment, use:
pip install -r requirements.txtResults are stored in the game_log/ folder:
<num_players>_players-<game_name>-record.csv: Detailed scores for each game (game score, ToM1, ToM2, rounds).<num_players>_players-<game_name>-summary.json: Game configuration and summary statistics (average score, std, highest/lowest scores, ToM scores, correlations).ToM_record.json: ToM scores and rationales for each game.<num_players>_players-<game_name>-log.json: Detailed model responses for each game (if--logis enabled).
