ERR-EVAL Benchmark

ERR-EVAL (Epistemic Reasoning & Reliability Evaluation) is an epistemic reliability benchmark that measures whether AI models can detect ambiguity, avoid hallucinating, localize uncertainty, and maintain calibrated confidence when inputs are incomplete, noisy, misleading, or internally inconsistent.

🔗 Live Leaderboard

What ERR-EVAL Measures

ERR-EVAL uses a 5-axis rubric, each scored 0–2:

Ambiguity Detection: Does the model notice that something is unclear or wrong?
Hallucination Avoidance: Does it avoid inventing facts/assumptions?
Localization of Uncertainty: Does it pinpoint exactly what is missing/contradictory?
Response Strategy: Does it ask the right clarifying question or propose valid branches?
Epistemic Tone: Is confidence calibrated and non-dismissive?

Core principle: Wrong-but-confident is strongly punished. "I don't have enough info, here are the branches and what I'd need to know" is rewarded.

Tracks

Track	Focus	Description
A	Noisy Perception	Corrupted sensory data, partial transcripts, timing ambiguity
B	Ambiguous Semantics	Underspecified pronouns, scope ambiguity, multiple parses
C	False Premise Traps	Subtly wrong assumptions that should be challenged
D	Underspecified Tasks	Missing constraints, goals, definitions
E	Conflicting Constraints	Quiet contradictions, mutually exclusive requirements

Installation

cd bench
python -m venv .venv
.\.venv\Scripts\Activate.ps1  # Windows PowerShell
# source .venv/bin/activate   # macOS/Linux
pip install -e .

Usage

Set your OpenRouter API key in bench/.env:

OPENROUTER_API_KEY=sk-or-v1-your-key-here

Run a single model evaluation:

python -m erreval evaluate --model "openai/gpt-5.2" --limit 25

Run all models from config/models.yaml:

python -m erreval run-all --skip-existing

Results

Results are saved to frontend/data/results.json for the leaderboard visualization.

Deployment

The frontend is automatically deployed to GitHub Pages on push to main. The leaderboard is available at:

https://ERR-EVAL.gustycube.com

Contact

Reach out on X (Twitter):
👉 @GustyCube

License

MIT License - Bennett Schwartz (GustyCube)

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.github/workflows		.github/workflows
bench		bench
frontend		frontend
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ERR-EVAL Benchmark

What ERR-EVAL Measures

Tracks

Installation

Usage

Results

Deployment

Contact

License

About

Uh oh!

Contributors 2

Uh oh!

Languages

License

GustyCube/ERR-EVAL

Folders and files

Latest commit

History

Repository files navigation

ERR-EVAL Benchmark

What ERR-EVAL Measures

Tracks

Installation

Usage

Results

Deployment

Contact

License

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Contributors 2

Uh oh!

Languages