📐 GAGE: General AI evaluation and Gauge Engine

English · 中文

Overview · Sample Schema · Game Arena · Support CLI · Contributing · AGENTS

GAGE is a unified, extensible evaluation framework designed for large language models, multimodal (omni, robot) models, audio models and diffusion models. It is a high-performance evaluation engine built for ultra-fast execution, scalability, and flexibility, providing a unified framework for AI model evaluation, agent-based benchmarking, and game arena evaluation.

✨ Why GAGE？

🚀 Fastest Evaluation Engine: Built for speed. GAGE fully utilizes GPU and CPU resources to run evaluations as fast as possible, scaling smoothly from single-machine testing to million-sample, multi-cluster runs.
🔗 All-in-one Evaluation Interface: Evaluate any dataset × any model with minimal glue code. GAGE provides a unified abstraction over datasets, models, metrics, and runtimes, allowing new benchmarks or model backends to be onboarded in minutes.
🔌 Extensible (Game & Agent) Sandbox: Natively supports game-based evaluation, agent environments, GUI interaction sandboxes, and tool-augmented tasks. All environments run under the same evaluation engine, making it easy to benchmark LLMs, multimodal models, and agents in a unified way.
🧩 Inheritance-Driven Extensibility: Easily extend existing benchmarks by inheriting and overriding only what you need. Add new datasets, metrics, or evaluation logic without touching the core framework or rewriting boilerplate code.
📡 Enterprise Observability: More than logs. GAGE provides real-time metrics and visibility into each evaluation stage, making it easy to monitor runs and quickly identify performance bottlenecks or failures.

🧭 Design Overview

Core Design Philosophy: Everything is a Step, Everything is configurable.

Architecture Design

Orchestration Design

GameArena Design

🚀 Quick Start

1. Installation

# If you're in a mono-repo root, run: cd gage-eval-main
# Python 3.10+ recommended
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

2. Run Demo

# Run Echo demo (No GPU required, uses Dummy Backend)
python run.py \
  --config config/run_configs/demo_echo_run_1.yaml \
  --output-dir runs \
  --run-id demo_echo

3. View Reports

Default output structure:

runs/<run_id>/
  events.jsonl  # Detailed event logs
  samples.jsonl # Samples with inputs and outputs
  summary.json  # Final score summary

📖 Advanced Configurations

Scenario	Config Example	Description
Basic QA	`config/custom/piqa_qwen3.yaml`	Text multiple-choice (PIQA)
LLM Judge	`config/custom/single_task_local_judge_qwen.yaml`	Use local LLM for grading
Game Arena	`config/custom/gomoku_human_vs_llm.yaml`	Gomoku Human vs LLM match
Code Gen	`config/custom/swebench_pro_smoke.yaml`	SWE-bench (requires Docker, experimental)

🗺️ Roadmap

🤖 Agent Evaluation: Add native agent benchmarking support with tool-use traces, trajectory scoring, and safety checks.
🎮 GameArena Expansion: Grow the game catalog and add richer rulesets, schedulers, and evaluation metrics.
🛠️ Gage-Client: A dedicated client tool focused on streamlined configuration management, failure diagnostics, and benchmark onboarding.
🌐 Distributed Inference: Introduce RoleType Controller architecture to support multi-node task sharding and load balancing for massive runs.
🚀 Benchmark Expansion: Continuous growth of the evaluation suite across diverse domains with out-of-the-box configs and guidance.

⚠️ Status

This project is in internal validation; APIs, configs, and docs may change rapidly.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
config		config
docs		docs
scripts		scripts
src/gage_eval		src/gage_eval
tests		tests
third_party/swebench_pro/run_scripts		third_party/swebench_pro/run_scripts
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
README_zh.md		README_zh.md
TESTING.md		TESTING.md
pytest.ini		pytest.ini
registry_manifest.yaml		registry_manifest.yaml
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📐 GAGE: General AI evaluation and Gauge Engine

✨ Why GAGE？

🧭 Design Overview

Architecture Design

Orchestration Design

GameArena Design

🚀 Quick Start

1. Installation

2. Run Demo

3. View Reports

📖 Advanced Configurations

🗺️ Roadmap

⚠️ Status

About

Uh oh!

Releases 2

Packages

Contributors 4

Languages

HiThink-Research/GAGE

Folders and files

Latest commit

History

Repository files navigation

📐 GAGE: General AI evaluation and Gauge Engine

✨ Why GAGE？

🧭 Design Overview

Architecture Design

Orchestration Design

GameArena Design

🚀 Quick Start

1. Installation

2. Run Demo

3. View Reports

📖 Advanced Configurations

🗺️ Roadmap

⚠️ Status

About

Topics

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 4

Languages

Packages