Game Reasoning Arena - Analysis Module

This directory contains tools for analyzing LLM reasoning patterns and game performance data collected from the Game Reasoning Arena experiments.

� Quick Start

Use these automated solutions:

Option 1: Simple One-Command Analysis

# From the project root directory:
./run_analysis.sh

Option 2: Python Quick Analysis

# From the project root directory:
python3 analysis/quick_analysis.py

Option 3: Full Pipeline with Options

# From the project root directory:
PYTHONPATH=. python3 analysis/run_full_analysis.py --help

# Examples:
python3 analysis/run_full_analysis.py                    # Default settings
python3 analysis/run_full_analysis.py --quiet            # Less verbose
python3 analysis/run_full_analysis.py --plots-dir custom_plots  # Custom output

# Game-specific and Model-specific Analysis
python3 analysis/run_full_analysis.py --game hex         # Analyze only HEX games
python3 analysis/run_full_analysis.py --model llama3     # Analyze only Llama3 models
python3 analysis/run_full_analysis.py --game hex --model llama3  # Combined filtering
python3 analysis/run_full_analysis.py --game tic_tac_toe --quiet # Quiet HEX analysis

These automated solutions will:

🔍 Auto-discover all SQLite databases in results/
🔄 Merge databases into consolidated CSV files
🎯 Apply filters (optional) for specific games or models
🧠 Analyze reasoning patterns using rule-based categorization
📊 Generate visualizations (plots, charts, heatmaps, word clouds)
📋 Create summary reports with pipeline statistics
⚡ Handle errors gracefully with detailed logging

Output: All results saved to plots/ directory + detailed logs

🎯 Game-Specific and Model-Specific Analysis

The analysis pipeline now supports filtering for specific games and models, allowing you to focus your analysis on particular scenarios.

Filter by Game

# Analyze only HEX games
python3 analysis/run_full_analysis.py --game hex

# Analyze only Tic-Tac-Toe games
python3 analysis/run_full_analysis.py --game tic_tac_toe

# Analyze only Connect Four games
python3 analysis/run_full_analysis.py --game connect_four

Filter by Model

# Analyze only Llama models (partial matching)
python3 analysis/run_full_analysis.py --model llama

# Analyze only GPT models
python3 analysis/run_full_analysis.py --model gpt

# Analyze specific model variant
python3 analysis/run_full_analysis.py --model llama3-8b

Combined Filtering

# Analyze HEX games played by Llama models only
python3 analysis/run_full_analysis.py --game hex --model llama

# Analyze Tic-Tac-Toe games with GPT models in quiet mode
python3 analysis/run_full_analysis.py --game tic_tac_toe --model gpt --quiet

Output Organization

When filters are applied, plots are automatically organized in subdirectories:

plots/game_hex/ - HEX-specific analysis
plots/model_llama/ - Llama model-specific analysis
plots/game_hex_model_llama/ - Combined filtering results

This makes it easy to focus on specific research questions without processing all data.

🔍 Model Filtering for Aggregate Plots

When analyzing data from many models, aggregate plots can become cluttered and hard to interpret. The analysis pipeline includes intelligent model filtering to focus on representative models for "all models" aggregate plots.

Model Filtering Options

# Default filtering (max 7 priority models)
python3 analysis/run_full_analysis.py

# Custom limit of 5 models
python3 analysis/run_full_analysis.py --max-models-aggregate 5

# Show all models (no filtering)
python3 analysis/run_full_analysis.py --no-model-filtering

# Use custom priority configuration
python3 analysis/run_full_analysis.py --priority-models-config my_models.yaml

How Model Filtering Works

Automatic Selection: When you have more than 7 models (default limit), the system automatically selects representative models from different families:

OpenAI models (GPT-4o-mini, etc.)
Meta Llama variants (3.1-8B, 3.1-70B, etc.)
Google models (Gemma, Gemini)
Qwen models
Mistral models

Priority Configuration: Model priorities are defined in src/game_reasoning_arena/configs/priority_models_config.yaml. You can:

Adjust the maximum number of models for aggregate plots
Modify the priority model list
Add new model families
Configure fallback behavior

Smart Matching: The system uses flexible name matching to handle different model naming conventions:

meta-llama/llama-3.1-8b-instruct matches llama-3.1-8b-instruct
openai/gpt-4o-mini matches gpt-4o-mini
Handles variations in separators (-, _, /)

Benefits

Cleaner Visualizations: Aggregate plots focus on representative models
Better Readability: Legends and labels remain manageable
Preserved Coverage: Individual model analysis remains unaffected
Configurable: Easy to customize for your specific research needs

Note: Model filtering only affects aggregate plots that show "all models" together. Individual model analysis and specific model comparisons are never filtered.

�📁 Directory Contents

Core Analysis Scripts

`reasoning_analysis.py` - Main Analysis Engine

The primary module for reasoning pattern analysis and categorization.

Key Features:

Reasoning Categorization: Automatically classifies LLM reasoning into 7 types:
- Positional: Center control, corner play, spatial strategies
- Blocking: Defensive moves, preventing opponent wins
- Opponent Modeling: Predicting opponent behavior
- Winning Logic: Identifying winning opportunities, creating threats
- Heuristic: General strategic principles
- Rule-Based: Following explicit game rules
- Random/Unjustified: Unclear or random reasoning
Visualization Generation: Creates multiple plot types:
- Word clouds of reasoning patterns
- Pie charts showing reasoning category distributions
- Heatmaps of move position preferences
- Statistical summaries of decision-making behavior

Usage:

from reasoning_analysis import LLMReasoningAnalyzer

# Initialize with CSV file from post-game processing
analyzer = LLMReasoningAnalyzer('results/merged_logs_YYYYMMDD_HHMMSS.csv')

# Categorize reasoning patterns
analyzer.categorize_reasoning()

# Generate visualizations
analyzer.compute_metrics(plot_dir='plots')
analyzer.plot_heatmaps_by_agent(output_dir='plots')
analyzer.plot_wordclouds_by_agent(output_dir='plots')

Dependencies: pandas, matplotlib, seaborn, wordcloud, transformers, numpy

`extract_reasoning_traces.py` - Data Extraction Tool (Standalone)

Comprehensive command-line tool for extracting and viewing reasoning traces from SQLite databases. This tool runs independently and is not part of the automated pipeline, allowing for ad-hoc detailed trace inspection.

Key Features:

Database Discovery: Automatically finds available database files
Flexible Filtering: Extract by game type, episode, or custom criteria
Multiple Output Formats: Text display, CSV export, JSON export
Pattern Analysis: Built-in statistics and reasoning pattern detection

Usage:

# List available databases (run without --db argument)
python extract_reasoning_traces.py

# Extract all traces from a specific database
python extract_reasoning_traces.py --db results/llm_litellm_groq_llama3_8b_8192.db

# Filter by game and episode
python extract_reasoning_traces.py --db results/llm_model.db --game tic_tac_toe --episode 1

# Export to CSV for further analysis
python extract_reasoning_traces.py --db results/llm_model.db --export-csv traces.csv

# Export formatted traces to text file (perfect for academic papers)
python extract_reasoning_traces.py --db results/llm_model.db --game tic_tac_toe --export-txt detailed_report.txt

# View analysis only (no detailed traces)
python extract_reasoning_traces.py --db results/llm_model.db --analyze-only

Dependencies: sqlite3, pandas, pathlib, argparse

`generate_reasoning_plots.py` - Reasoning Analysis Visualization

Generates comprehensive reasoning analysis plots including model comparisons, game-specific analysis, and evolution plots.

Key Features:

Model Name Cleaning: Standardizes model names for clear visualization
Multiple Plot Types: Bar charts, pie charts, stacked charts, and evolution plots
Game-Specific Analysis: Individual plots per game per model
Aggregated Views: Cross-game reasoning pattern analysis

Usage:

from generate_reasoning_plots import plot_reasoning_bar_chart, clean_model_name

# Generate reasoning distribution chart for a model
reasoning_percentages = {'Positional': 30.0, 'Blocking': 25.0, 'Winning Logic': 20.0}
model_name = clean_model_name('llm_litellm_groq_llama3_8b_8192')
plot_reasoning_bar_chart(reasoning_percentages, model_name, 'output_chart.png')

Command-line usage:

python analysis/generate_reasoning_plots.py

Dependencies: matplotlib, pandas, pathlib

`post_game_processing.py` - Data Aggregation and Processing

Merges individual agent SQLite databases into consolidated CSV files for analysis.

Key Features:

Database Merging: Combines all agent-specific SQLite logs
Data Validation: Ensures data consistency and completeness
Summary Statistics: Computes game-level and episode-level metrics
Timestamped Output: Generates uniquely named merged files

Usage:

from post_game_processing import merge_sqlite_logs, compute_summary_statistics

# Merge all SQLite logs in results directory
merged_df = merge_sqlite_logs('results/')

# Compute summary statistics
summary_stats = compute_summary_statistics(merged_df)

# Results saved to: results/merged_logs_YYYYMMDD_HHMMSS.csv

Dependencies: sqlite3, pandas, pathlib, datetime

🔄 Automated Analysis Workflow (RECOMMENDED)

The automated pipeline replaces the manual multi-step process below. Instead of running each script individually, you can now:

Single Command Analysis

./run_analysis.sh                    # Interactive analysis with progress tracking
./run_analysis.sh --quiet            # Quiet mode
./run_analysis.sh --full             # Full analysis with all options

# Note: For game/model filtering, use the Python pipeline:
python3 analysis/run_full_analysis.py --game hex --model llama

Python Pipeline

python3 analysis/run_full_analysis.py [options]

# Filtering options:
python3 analysis/run_full_analysis.py --game hex      # HEX games only
python3 analysis/run_full_analysis.py --model llama   # Llama models only
python3 analysis/run_full_analysis.py --game hex --model llama  # Combined

What the automated pipeline does:

Database Discovery & Merging: Automatically finds and merges all .db files in results/
Reasoning Analysis: Categorizes reasoning patterns using rule-based classification
Visualization Generation: Creates comprehensive plots, charts, heatmaps, and word clouds
Error Handling: Continues execution even if individual steps fail
Progress Tracking: Provides detailed logging and progress updates
Summary Reporting: Generates JSON reports with pipeline statistics

Note: For detailed reasoning trace extraction, use the standalone tool:

python analysis/extract_reasoning_traces.py --db results/your_database.db

Generated Output:

plots/*.png - All visualization files
results/merged_logs_*.csv - Consolidated data files
results/tables/*.csv - Performance tables and metrics

All execution details are displayed in the console output.

🔧 Manual Analysis (Legacy Workflow)

Note: The manual workflow below is still available but not recommended. Use the automated pipeline above for much better experience.

🔄 Typical Analysis Workflow

1. Data Collection

Run games with LLM models - reasoning traces are automatically collected:

python scripts/runner.py --config configs/example_config.yaml --override \
  env_config.game_name=tic_tac_toe \
  agents.player_0.type=llm \
  agents.player_0.model=litellm_groq/llama3-8b-8192 \
  num_episodes=10

2. Data Processing

Merge individual databases into analysis-ready format:

python -c "
from analysis.post_game_processing import merge_sqlite_logs
merge_sqlite_logs('results/')
"

3. Quick Data Exploration

View reasoning traces to understand the data:

python analysis/extract_reasoning_traces.py --db results/llm_model.db --analyze-only

4. Comprehensive Analysis

Generate full reasoning analysis and visualizations:

python -c "
from analysis.reasoning_analysis import LLMReasoningAnalyzer
analyzer = LLMReasoningAnalyzer('results/merged_logs_latest.csv')
analyzer.categorize_reasoning()
analyzer.compute_metrics(plot_dir='plots')
analyzer.plot_heatmaps_by_agent(output_dir='plots')
analyzer.plot_wordclouds_by_agent(output_dir='plots')
"

5. Model Comparison

Compare reasoning patterns across different models:

from analysis.generate_reasoning_plots import ReasoningPlotGenerator
plotter = ReasoningPlotGenerator('results/merged_logs_latest.csv')
plotter.generate_model_plots('plots/')

📊 Generated Outputs

Database Files (`results/*.db`)

Individual agent reasoning traces
SQLite format for efficient querying
Contains: game_name, episode, turn, action, reasoning, board_state, timestamp

Merged CSV Files (`results/merged_logs_*.csv`)

Consolidated data from all models
Ready for statistical analysis
Timestamped for version control

Visualization Files (`plots/`)

wordcloud_<model>_<game>.png - Common reasoning terms
pie_reasoning_type_<model>_<game>.png - Reasoning category distributions
heatmap_<model>_<game>.png - Move position preferences
reasoning_bar_chart_<model>.png - Model-specific reasoning breakdowns
entropy_by_turn_all_models_<game>.png - Reasoning diversity over time
reasoning_evolution_<model>_<game>.png - How reasoning patterns change during games

⚠️ Important Note: Short Game Limitations

Short games (like Kuhn Poker, Matching Pennies, and Prisoner's Dilemma) may have limited or empty entropy/evolution visualizations due to:

Few turns per game: Games lasting only 1-2 turns provide insufficient data for meaningful entropy calculations
Limited reasoning diversity: With only 1-2 reasoning entries per agent per turn, entropy values are often zero
No evolution patterns: Reasoning evolution requires multiple turns to show meaningful progression
Sparse data: Individual models may have too few data points for statistical analysis

Recommendation: Focus on longer games (Tic-Tac-Toe, Connect Four) for entropy and evolution analysis. Short games are better suited for reasoning category distribution analysis (pie charts, bar charts).

🧪 Research Applications

Model Comparison Studies

Compare reasoning sophistication across different LLMs
Identify model-specific strategic preferences
Evaluate reasoning consistency within models

Game Strategy Analysis

Understand how LLMs approach different game types
Identify common strategic patterns and misconceptions
Analyze adaptation to different opponents

Reasoning Quality Assessment

Categorize and quantify reasoning types
Identify gaps in strategic thinking
Evaluate decision-making consistency

Performance Correlation

Link reasoning quality to game outcomes
Identify which reasoning types lead to better performance
Study the relationship between reasoning length and quality

🔧 Configuration and Customization

Adding New Reasoning Categories

Modify REASONING_RULES in reasoning_analysis.py:

REASONING_RULES = {
    "Custom_Category": [
        re.compile(r"\bcustom_pattern\b"),
        re.compile(r"\banother_pattern\b")
    ],
    # ... existing categories
}

Custom Visualization Themes

Modify plotting functions in generate_reasoning_plots.py for custom styling, colors, and layouts.

Database Schema Extensions

The SQLite schema can be extended by modifying the logging functions in the main arena codebase.

📚 Dependencies

Core requirements for the analysis module:

pip install pandas matplotlib seaborn wordcloud transformers numpy

🐛 Troubleshooting

Empty or Missing Entropy/Evolution Plots

Short games (Kuhn Poker, Matching Pennies, Prisoner's Dilemma) naturally produce sparse entropy data
Games with only 1-2 turns cannot show meaningful reasoning evolution
Consider focusing analysis on longer games (Tic-Tac-Toe, Connect Four) for temporal analysis
Use reasoning category distribution plots (pie/bar charts) for short games instead

Memory Issues with Large Datasets

Process data in chunks using pandas chunksize parameter
Filter data by game type or time period before analysis
Use --game and --model filters to analyze specific subsets
Use SQLite queries to pre-filter before loading into memory

Focused Analysis

Use --game hex to analyze only HEX games for faster processing
Use --model llama to compare only Llama model variants
Combine filters: --game hex --model llama for targeted research questions

FilesExpand file tree

README_how_to.md

Latest commit

History

README_how_to.md

File metadata and controls

Game Reasoning Arena - Analysis Module

� Quick Start

Option 1: Simple One-Command Analysis

Option 2: Python Quick Analysis

Option 3: Full Pipeline with Options

🎯 Game-Specific and Model-Specific Analysis

Filter by Game

Filter by Model

Combined Filtering

Output Organization

🔍 Model Filtering for Aggregate Plots

Model Filtering Options

How Model Filtering Works

Benefits

�📁 Directory Contents

Core Analysis Scripts

reasoning_analysis.py - Main Analysis Engine

extract_reasoning_traces.py - Data Extraction Tool (Standalone)

generate_reasoning_plots.py - Reasoning Analysis Visualization

post_game_processing.py - Data Aggregation and Processing

🔄 Automated Analysis Workflow (RECOMMENDED)

Single Command Analysis

Python Pipeline

🔧 Manual Analysis (Legacy Workflow)

🔄 Typical Analysis Workflow

1. Data Collection

2. Data Processing

3. Quick Data Exploration

4. Comprehensive Analysis

5. Model Comparison

📊 Generated Outputs

Database Files (results/*.db)

Merged CSV Files (results/merged_logs_*.csv)

Visualization Files (plots/)

⚠️ Important Note: Short Game Limitations

🧪 Research Applications

Model Comparison Studies

Game Strategy Analysis

Reasoning Quality Assessment

Performance Correlation

🔧 Configuration and Customization

Adding New Reasoning Categories

Custom Visualization Themes

Database Schema Extensions

📚 Dependencies

🐛 Troubleshooting

Empty or Missing Entropy/Evolution Plots

Memory Issues with Large Datasets

Focused Analysis

`reasoning_analysis.py` - Main Analysis Engine

`extract_reasoning_traces.py` - Data Extraction Tool (Standalone)

`generate_reasoning_plots.py` - Reasoning Analysis Visualization

`post_game_processing.py` - Data Aggregation and Processing

Database Files (`results/*.db`)

Merged CSV Files (`results/merged_logs_*.csv`)

Visualization Files (`plots/`)