How do you know your AI isn't getting worse at reading your documents?
eval-guard is a lightweight Python library that runs your LLM through four regression tests and stores the scores in a local SQLite database. Drop it into an agent loop, a research pipeline, or a CI job - the database builds a time-series picture of model quality so you can spot drift before it reaches production.
You can use it as a library inside your own code, or run it from the command line when you want a quick snapshot.
- Factuality - Checks whether the model gives correct factual answers to straightforward questions about geography, science, and math.
- Consistency - Measures whether the model gives similar answers when the same question is asked in several different ways.
- Context Adherence - Tests whether the model admits it does not know the answer when the information is not present in the provided text, rather than making something up.
- Latency - Measures how fast the model generates tokens and how much GPU memory it uses. This is an infrastructure health metric, not a quality metric.
from eval_guard.runner import run_all
from eval_guard.ollama_client import OllamaClient
from eval_guard.results_db import init_db, get_trend
client = OllamaClient(base_url="http://localhost:11434")
# Run every benchmark, write results, and get a structured report
report = run_all(client, model="llama3.1:8b", db_path="eval_guard.db")
print(report["results"]["factuality"]["score"]) # 0.91
# Query the last 10 runs to see if quality is drifting
conn = init_db("eval_guard.db")
trend = get_trend(conn, benchmark="factuality", model="llama3.1:8b", n=10)
# trend == [{"run_at": "2026-06-07T14:00:00", "score": 0.91}, ...]You can also fail programmatically when a score drops below a threshold:
report = run_all(client, "llama3.1:8b", db_path="eval_guard.db", fail_below=True)
if "failures" in report:
send_alert(report["failures"]) # your alerting hookInstall eval-guard and its dependencies:
pip install -e .
Run all four benchmarks:
python run_eval.py --model llama3.1:8b --benchmark all
Fail the process if any score is below the acceptable threshold:
python run_eval.py --model llama3.1:8b --benchmark all --fail-below
Launch the Streamlit dashboard to see results over time, filter by model and date, refresh with one click, or download the data as CSV:
streamlit run run_dashboard.py
| Benchmark | Good score | Bad score | What a bad score means for your business |
|---|---|---|---|
| Factuality | 1.0 | Below 0.8 | Your model is giving wrong answers. Clients lose trust. |
| Consistency | Above 0.5 | Below 0.3 | Your model gives different answers to the same question depending on how you ask. Confusing and unreliable. |
| Context Adherence | Above 0.8 | Below 0.2 | Your model invents facts that were never in your documents. This is how hallucinations become client-facing. |
| Latency | Above 0.8 | Below 0.3 | Your model is responding slowly. Clients are waiting. |
All prompts, keywords, rephrasings, and thresholds live in a single YAML file:
src/eval_guard/config.yaml
Changing a test question or adjusting a threshold requires zero code changes and no reinstall.
The database file (eval_guard.db) is created automatically the first time you run a benchmark. It is not included in this repository.
pip install -e ".[dev]"
pytest tests/ -v
python smoke_test.py
ruff check src/ tests/
- Python 3.12+
- Ollama installed and running (or any OpenAI-compatible endpoint)
- At least one model pulled locally (e.g.
ollama pull llama3.1:8b)