This directory contains tools for analyzing LLM reasoning patterns and game performance data collected from the Game Reasoning Arena experiments.
Use these automated solutions:
# From the project root directory:
./run_analysis.sh# From the project root directory:
python3 analysis/quick_analysis.py# From the project root directory:
PYTHONPATH=. python3 analysis/run_full_analysis.py --help
# Examples:
python3 analysis/run_full_analysis.py # Default settings
python3 analysis/run_full_analysis.py --quiet # Less verbose
python3 analysis/run_full_analysis.py --plots-dir custom_plots # Custom output
# Game-specific and Model-specific Analysis
python3 analysis/run_full_analysis.py --game hex # Analyze only HEX games
python3 analysis/run_full_analysis.py --model llama3 # Analyze only Llama3 models
python3 analysis/run_full_analysis.py --game hex --model llama3 # Combined filtering
python3 analysis/run_full_analysis.py --game tic_tac_toe --quiet # Quiet HEX analysisThese automated solutions will:
- 🔍 Auto-discover all SQLite databases in
results/ - 🔄 Merge databases into consolidated CSV files
- 🎯 Apply filters (optional) for specific games or models
- 🧠 Analyze reasoning patterns using rule-based categorization
- 📊 Generate visualizations (plots, charts, heatmaps, word clouds)
- 📋 Create summary reports with pipeline statistics
- ⚡ Handle errors gracefully with detailed logging
Output: All results saved to plots/ directory + detailed logs
The analysis pipeline now supports filtering for specific games and models, allowing you to focus your analysis on particular scenarios.
# Analyze only HEX games
python3 analysis/run_full_analysis.py --game hex
# Analyze only Tic-Tac-Toe games
python3 analysis/run_full_analysis.py --game tic_tac_toe
# Analyze only Connect Four games
python3 analysis/run_full_analysis.py --game connect_four# Analyze only Llama models (partial matching)
python3 analysis/run_full_analysis.py --model llama
# Analyze only GPT models
python3 analysis/run_full_analysis.py --model gpt
# Analyze specific model variant
python3 analysis/run_full_analysis.py --model llama3-8b# Analyze HEX games played by Llama models only
python3 analysis/run_full_analysis.py --game hex --model llama
# Analyze Tic-Tac-Toe games with GPT models in quiet mode
python3 analysis/run_full_analysis.py --game tic_tac_toe --model gpt --quietWhen filters are applied, plots are automatically organized in subdirectories:
plots/game_hex/- HEX-specific analysisplots/model_llama/- Llama model-specific analysisplots/game_hex_model_llama/- Combined filtering results
This makes it easy to focus on specific research questions without processing all data.
When analyzing data from many models, aggregate plots can become cluttered and hard to interpret. The analysis pipeline includes intelligent model filtering to focus on representative models for "all models" aggregate plots.
# Default filtering (max 7 priority models)
python3 analysis/run_full_analysis.py
# Custom limit of 5 models
python3 analysis/run_full_analysis.py --max-models-aggregate 5
# Show all models (no filtering)
python3 analysis/run_full_analysis.py --no-model-filtering
# Use custom priority configuration
python3 analysis/run_full_analysis.py --priority-models-config my_models.yamlAutomatic Selection: When you have more than 7 models (default limit), the system automatically selects representative models from different families:
- OpenAI models (GPT-4o-mini, etc.)
- Meta Llama variants (3.1-8B, 3.1-70B, etc.)
- Google models (Gemma, Gemini)
- Qwen models
- Mistral models
Priority Configuration: Model priorities are defined in src/game_reasoning_arena/configs/priority_models_config.yaml. You can:
- Adjust the maximum number of models for aggregate plots
- Modify the priority model list
- Add new model families
- Configure fallback behavior
Smart Matching: The system uses flexible name matching to handle different model naming conventions:
meta-llama/llama-3.1-8b-instructmatchesllama-3.1-8b-instructopenai/gpt-4o-minimatchesgpt-4o-mini- Handles variations in separators (
-,_,/)
- Cleaner Visualizations: Aggregate plots focus on representative models
- Better Readability: Legends and labels remain manageable
- Preserved Coverage: Individual model analysis remains unaffected
- Configurable: Easy to customize for your specific research needs
Note: Model filtering only affects aggregate plots that show "all models" together. Individual model analysis and specific model comparisons are never filtered.
The primary module for reasoning pattern analysis and categorization.
Key Features:
-
Reasoning Categorization: Automatically classifies LLM reasoning into 7 types:
Positional: Center control, corner play, spatial strategiesBlocking: Defensive moves, preventing opponent winsOpponent Modeling: Predicting opponent behaviorWinning Logic: Identifying winning opportunities, creating threatsHeuristic: General strategic principlesRule-Based: Following explicit game rulesRandom/Unjustified: Unclear or random reasoning
-
Visualization Generation: Creates multiple plot types:
- Word clouds of reasoning patterns
- Pie charts showing reasoning category distributions
- Heatmaps of move position preferences
- Statistical summaries of decision-making behavior
Usage:
from reasoning_analysis import LLMReasoningAnalyzer
# Initialize with CSV file from post-game processing
analyzer = LLMReasoningAnalyzer('results/merged_logs_YYYYMMDD_HHMMSS.csv')
# Categorize reasoning patterns
analyzer.categorize_reasoning()
# Generate visualizations
analyzer.compute_metrics(plot_dir='plots')
analyzer.plot_heatmaps_by_agent(output_dir='plots')
analyzer.plot_wordclouds_by_agent(output_dir='plots')Dependencies: pandas, matplotlib, seaborn, wordcloud, transformers, numpy
Comprehensive command-line tool for extracting and viewing reasoning traces from SQLite databases. This tool runs independently and is not part of the automated pipeline, allowing for ad-hoc detailed trace inspection.
Key Features:
- Database Discovery: Automatically finds available database files
- Flexible Filtering: Extract by game type, episode, or custom criteria
- Multiple Output Formats: Text display, CSV export, JSON export
- Pattern Analysis: Built-in statistics and reasoning pattern detection
Usage:
# List available databases (run without --db argument)
python extract_reasoning_traces.py
# Extract all traces from a specific database
python extract_reasoning_traces.py --db results/llm_litellm_groq_llama3_8b_8192.db
# Filter by game and episode
python extract_reasoning_traces.py --db results/llm_model.db --game tic_tac_toe --episode 1
# Export to CSV for further analysis
python extract_reasoning_traces.py --db results/llm_model.db --export-csv traces.csv
# Export formatted traces to text file (perfect for academic papers)
python extract_reasoning_traces.py --db results/llm_model.db --game tic_tac_toe --export-txt detailed_report.txt
# View analysis only (no detailed traces)
python extract_reasoning_traces.py --db results/llm_model.db --analyze-onlyDependencies: sqlite3, pandas, pathlib, argparse
Generates comprehensive reasoning analysis plots including model comparisons, game-specific analysis, and evolution plots.
Key Features:
- Model Name Cleaning: Standardizes model names for clear visualization
- Multiple Plot Types: Bar charts, pie charts, stacked charts, and evolution plots
- Game-Specific Analysis: Individual plots per game per model
- Aggregated Views: Cross-game reasoning pattern analysis
Usage:
from generate_reasoning_plots import plot_reasoning_bar_chart, clean_model_name
# Generate reasoning distribution chart for a model
reasoning_percentages = {'Positional': 30.0, 'Blocking': 25.0, 'Winning Logic': 20.0}
model_name = clean_model_name('llm_litellm_groq_llama3_8b_8192')
plot_reasoning_bar_chart(reasoning_percentages, model_name, 'output_chart.png')Command-line usage:
python analysis/generate_reasoning_plots.pyDependencies: matplotlib, pandas, pathlib
Merges individual agent SQLite databases into consolidated CSV files for analysis.
Key Features:
- Database Merging: Combines all agent-specific SQLite logs
- Data Validation: Ensures data consistency and completeness
- Summary Statistics: Computes game-level and episode-level metrics
- Timestamped Output: Generates uniquely named merged files
Usage:
from post_game_processing import merge_sqlite_logs, compute_summary_statistics
# Merge all SQLite logs in results directory
merged_df = merge_sqlite_logs('results/')
# Compute summary statistics
summary_stats = compute_summary_statistics(merged_df)
# Results saved to: results/merged_logs_YYYYMMDD_HHMMSS.csvDependencies: sqlite3, pandas, pathlib, datetime
The automated pipeline replaces the manual multi-step process below. Instead of running each script individually, you can now:
./run_analysis.sh # Interactive analysis with progress tracking
./run_analysis.sh --quiet # Quiet mode
./run_analysis.sh --full # Full analysis with all options
# Note: For game/model filtering, use the Python pipeline:
python3 analysis/run_full_analysis.py --game hex --model llamapython3 analysis/run_full_analysis.py [options]
# Filtering options:
python3 analysis/run_full_analysis.py --game hex # HEX games only
python3 analysis/run_full_analysis.py --model llama # Llama models only
python3 analysis/run_full_analysis.py --game hex --model llama # CombinedWhat the automated pipeline does:
- Database Discovery & Merging: Automatically finds and merges all
.dbfiles inresults/ - Reasoning Analysis: Categorizes reasoning patterns using rule-based classification
- Visualization Generation: Creates comprehensive plots, charts, heatmaps, and word clouds
- Error Handling: Continues execution even if individual steps fail
- Progress Tracking: Provides detailed logging and progress updates
- Summary Reporting: Generates JSON reports with pipeline statistics
Note: For detailed reasoning trace extraction, use the standalone tool:
python analysis/extract_reasoning_traces.py --db results/your_database.dbGenerated Output:
plots/*.png- All visualization filesresults/merged_logs_*.csv- Consolidated data filesresults/tables/*.csv- Performance tables and metrics
All execution details are displayed in the console output.
Note: The manual workflow below is still available but not recommended. Use the automated pipeline above for much better experience.
Run games with LLM models - reasoning traces are automatically collected:
python scripts/runner.py --config configs/example_config.yaml --override \
env_config.game_name=tic_tac_toe \
agents.player_0.type=llm \
agents.player_0.model=litellm_groq/llama3-8b-8192 \
num_episodes=10Merge individual databases into analysis-ready format:
python -c "
from analysis.post_game_processing import merge_sqlite_logs
merge_sqlite_logs('results/')
"View reasoning traces to understand the data:
python analysis/extract_reasoning_traces.py --db results/llm_model.db --analyze-onlyGenerate full reasoning analysis and visualizations:
python -c "
from analysis.reasoning_analysis import LLMReasoningAnalyzer
analyzer = LLMReasoningAnalyzer('results/merged_logs_latest.csv')
analyzer.categorize_reasoning()
analyzer.compute_metrics(plot_dir='plots')
analyzer.plot_heatmaps_by_agent(output_dir='plots')
analyzer.plot_wordclouds_by_agent(output_dir='plots')
"Compare reasoning patterns across different models:
from analysis.generate_reasoning_plots import ReasoningPlotGenerator
plotter = ReasoningPlotGenerator('results/merged_logs_latest.csv')
plotter.generate_model_plots('plots/')- Individual agent reasoning traces
- SQLite format for efficient querying
- Contains: game_name, episode, turn, action, reasoning, board_state, timestamp
- Consolidated data from all models
- Ready for statistical analysis
- Timestamped for version control
wordcloud_<model>_<game>.png- Common reasoning termspie_reasoning_type_<model>_<game>.png- Reasoning category distributionsheatmap_<model>_<game>.png- Move position preferencesreasoning_bar_chart_<model>.png- Model-specific reasoning breakdownsentropy_by_turn_all_models_<game>.png- Reasoning diversity over timereasoning_evolution_<model>_<game>.png- How reasoning patterns change during games
Short games (like Kuhn Poker, Matching Pennies, and Prisoner's Dilemma) may have limited or empty entropy/evolution visualizations due to:
- Few turns per game: Games lasting only 1-2 turns provide insufficient data for meaningful entropy calculations
- Limited reasoning diversity: With only 1-2 reasoning entries per agent per turn, entropy values are often zero
- No evolution patterns: Reasoning evolution requires multiple turns to show meaningful progression
- Sparse data: Individual models may have too few data points for statistical analysis
Recommendation: Focus on longer games (Tic-Tac-Toe, Connect Four) for entropy and evolution analysis. Short games are better suited for reasoning category distribution analysis (pie charts, bar charts).
- Compare reasoning sophistication across different LLMs
- Identify model-specific strategic preferences
- Evaluate reasoning consistency within models
- Understand how LLMs approach different game types
- Identify common strategic patterns and misconceptions
- Analyze adaptation to different opponents
- Categorize and quantify reasoning types
- Identify gaps in strategic thinking
- Evaluate decision-making consistency
- Link reasoning quality to game outcomes
- Identify which reasoning types lead to better performance
- Study the relationship between reasoning length and quality
Modify REASONING_RULES in reasoning_analysis.py:
REASONING_RULES = {
"Custom_Category": [
re.compile(r"\bcustom_pattern\b"),
re.compile(r"\banother_pattern\b")
],
# ... existing categories
}Modify plotting functions in generate_reasoning_plots.py for custom styling, colors, and layouts.
The SQLite schema can be extended by modifying the logging functions in the main arena codebase.
Core requirements for the analysis module:
pip install pandas matplotlib seaborn wordcloud transformers numpy- Short games (Kuhn Poker, Matching Pennies, Prisoner's Dilemma) naturally produce sparse entropy data
- Games with only 1-2 turns cannot show meaningful reasoning evolution
- Consider focusing analysis on longer games (Tic-Tac-Toe, Connect Four) for temporal analysis
- Use reasoning category distribution plots (pie/bar charts) for short games instead
- Process data in chunks using pandas
chunksizeparameter - Filter data by game type or time period before analysis
- Use
--gameand--modelfilters to analyze specific subsets - Use SQLite queries to pre-filter before loading into memory
- Use
--game hexto analyze only HEX games for faster processing - Use
--model llamato compare only Llama model variants - Combine filters:
--game hex --model llamafor targeted research questions