Skip to content

WBChain3/eval_guard

Repository files navigation

eval-guard

How do you know your AI isn't getting worse at reading your documents?

What this is

eval-guard is a lightweight Python library that runs your LLM through four regression tests and stores the scores in a local SQLite database. Drop it into an agent loop, a research pipeline, or a CI job - the database builds a time-series picture of model quality so you can spot drift before it reaches production.

You can use it as a library inside your own code, or run it from the command line when you want a quick snapshot.

What it measures

  • Factuality - Checks whether the model gives correct factual answers to straightforward questions about geography, science, and math.
  • Consistency - Measures whether the model gives similar answers when the same question is asked in several different ways.
  • Context Adherence - Tests whether the model admits it does not know the answer when the information is not present in the provided text, rather than making something up.
  • Latency - Measures how fast the model generates tokens and how much GPU memory it uses. This is an infrastructure health metric, not a quality metric.

Drop it into your agent loop

from eval_guard.runner import run_all
from eval_guard.ollama_client import OllamaClient
from eval_guard.results_db import init_db, get_trend

client = OllamaClient(base_url="http://localhost:11434")

# Run every benchmark, write results, and get a structured report
report = run_all(client, model="llama3.1:8b", db_path="eval_guard.db")
print(report["results"]["factuality"]["score"])  # 0.91

# Query the last 10 runs to see if quality is drifting
conn = init_db("eval_guard.db")
trend = get_trend(conn, benchmark="factuality", model="llama3.1:8b", n=10)
# trend == [{"run_at": "2026-06-07T14:00:00", "score": 0.91}, ...]

You can also fail programmatically when a score drops below a threshold:

report = run_all(client, "llama3.1:8b", db_path="eval_guard.db", fail_below=True)
if "failures" in report:
    send_alert(report["failures"])  # your alerting hook

Use it from the command line

Install eval-guard and its dependencies:

pip install -e .

Run all four benchmarks:

python run_eval.py --model llama3.1:8b --benchmark all

Fail the process if any score is below the acceptable threshold:

python run_eval.py --model llama3.1:8b --benchmark all --fail-below

Launch the Streamlit dashboard to see results over time, filter by model and date, refresh with one click, or download the data as CSV:

streamlit run run_dashboard.py

What the scores mean

Benchmark Good score Bad score What a bad score means for your business
Factuality 1.0 Below 0.8 Your model is giving wrong answers. Clients lose trust.
Consistency Above 0.5 Below 0.3 Your model gives different answers to the same question depending on how you ask. Confusing and unreliable.
Context Adherence Above 0.8 Below 0.2 Your model invents facts that were never in your documents. This is how hallucinations become client-facing.
Latency Above 0.8 Below 0.3 Your model is responding slowly. Clients are waiting.

Configuration-driven benchmarks

All prompts, keywords, rephrasings, and thresholds live in a single YAML file:

src/eval_guard/config.yaml

Changing a test question or adjusting a threshold requires zero code changes and no reinstall.

The database is local

The database file (eval_guard.db) is created automatically the first time you run a benchmark. It is not included in this repository.

Development

pip install -e ".[dev]"
pytest tests/ -v
python smoke_test.py
ruff check src/ tests/

Requirements

  • Python 3.12+
  • Ollama installed and running (or any OpenAI-compatible endpoint)
  • At least one model pulled locally (e.g. ollama pull llama3.1:8b)

About

Lightweight Python library for detecting LLM quality drift. Benchmarks factuality, consistency, context adherence, and latency via semantic embeddings, with SQLite trend storage, programmatic queries, and a Streamlit dashboard.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages