eval-guard

How do you know your AI isn't getting worse at reading your documents?

What this is

eval-guard is a lightweight Python library that runs your LLM through four regression tests and stores the scores in a local SQLite database. Drop it into an agent loop, a research pipeline, or a CI job - the database builds a time-series picture of model quality so you can spot drift before it reaches production.

You can use it as a library inside your own code, or run it from the command line when you want a quick snapshot.

What it measures

Factuality - Checks whether the model gives correct factual answers to straightforward questions about geography, science, and math.
Consistency - Measures whether the model gives similar answers when the same question is asked in several different ways.
Context Adherence - Tests whether the model admits it does not know the answer when the information is not present in the provided text, rather than making something up.
Latency - Measures how fast the model generates tokens and how much GPU memory it uses. This is an infrastructure health metric, not a quality metric.

Drop it into your agent loop

from eval_guard.runner import run_all
from eval_guard.ollama_client import OllamaClient
from eval_guard.results_db import init_db, get_trend

client = OllamaClient(base_url="http://localhost:11434")

# Run every benchmark, write results, and get a structured report
report = run_all(client, model="llama3.1:8b", db_path="eval_guard.db")
print(report["results"]["factuality"]["score"])  # 0.91

# Query the last 10 runs to see if quality is drifting
conn = init_db("eval_guard.db")
trend = get_trend(conn, benchmark="factuality", model="llama3.1:8b", n=10)
# trend == [{"run_at": "2026-06-07T14:00:00", "score": 0.91}, ...]

You can also fail programmatically when a score drops below a threshold:

report = run_all(client, "llama3.1:8b", db_path="eval_guard.db", fail_below=True)
if "failures" in report:
    send_alert(report["failures"])  # your alerting hook

Use it from the command line

Install eval-guard and its dependencies:

pip install -e .

Run all four benchmarks:

python run_eval.py --model llama3.1:8b --benchmark all

Fail the process if any score is below the acceptable threshold:

python run_eval.py --model llama3.1:8b --benchmark all --fail-below

Launch the Streamlit dashboard to see results over time, filter by model and date, refresh with one click, or download the data as CSV:

streamlit run run_dashboard.py

What the scores mean

Benchmark	Good score	Bad score	What a bad score means for your business
Factuality	1.0	Below 0.8	Your model is giving wrong answers. Clients lose trust.
Consistency	Above 0.5	Below 0.3	Your model gives different answers to the same question depending on how you ask. Confusing and unreliable.
Context Adherence	Above 0.8	Below 0.2	Your model invents facts that were never in your documents. This is how hallucinations become client-facing.
Latency	Above 0.8	Below 0.3	Your model is responding slowly. Clients are waiting.

Configuration-driven benchmarks

All prompts, keywords, rephrasings, and thresholds live in a single YAML file:

src/eval_guard/config.yaml

Changing a test question or adjusting a threshold requires zero code changes and no reinstall.

The database is local

The database file (eval_guard.db) is created automatically the first time you run a benchmark. It is not included in this repository.

Development

pip install -e ".[dev]"
pytest tests/ -v
python smoke_test.py
ruff check src/ tests/

Requirements

Python 3.12+
Ollama installed and running (or any OpenAI-compatible endpoint)
At least one model pulled locally (e.g. ollama pull llama3.1:8b)

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
src/eval_guard		src/eval_guard
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
run_dashboard.py		run_dashboard.py
run_eval.py		run_eval.py
smoke_test.py		smoke_test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

eval-guard

What this is

What it measures

Drop it into your agent loop

Use it from the command line

What the scores mean

Configuration-driven benchmarks

The database is local

Development

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

eval-guard

What this is

What it measures

Drop it into your agent loop

Use it from the command line

What the scores mean

Configuration-driven benchmarks

The database is local

Development

Requirements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages