How do you know if your AI is actually good?
Think about it. You ask GPT-4 a question, it gives you an answer. Sounds reasonable. But is it correct? Is it faithful to the sources? Is it even relevant to what you asked?
Most people just... read the answer and think "yeah, that looks right." But here's the thing β that doesn't scale. When you have thousands of queries, you can't manually check each one. You need a system.
That's what this framework does.
This project was built as part of a production-grade ML engineering assignment. The goal: create a comprehensive LLM evaluation framework that goes beyond simple metrics.
The core problem this addresses:
Models can sound confident while being completely wrong.
It's called hallucination. The model generates fluent, convincing text that has nothing to do with reality. And basic metrics like BLEU and ROUGE don't catch it β they just count word overlap.
This framework addresses that by measuring multiple dimensions of quality.
Evaluating LLMs isn't a single-number problem. It's a multi-dimensional problem.
Think about what can go wrong with a model's answer:
- It could be grammatically perfect but factually wrong
- It could be accurate but irrelevant to the question
- It could sound good but be made up (not grounded in sources)
- It could answer a different question than what was asked
No single metric catches all of these. So this framework uses six different lenses:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β REFERENCE-BASED β
β βββββββββββ βββββββββββ ββββββββββββββ β
β β BLEU β β ROUGE-L β β BERTScore β β
β β n-gram β β LCS β β semantic β β
β βββββββββββ βββββββββββ ββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β RAG-SPECIFIC β
β βββββββββββββββ ββββββββββββββββ βββββββββββββββββ β
β βFaithfulness β β Context β β Answer β β
β β grounded? β β Relevancy β β Relevancy β β
β βββββββββββββββ ββββββββββββββββ βββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Each metric was chosen because it catches something the others miss:
| Metric | Why I included it | What it catches that others don't |
|---|---|---|
| BLEU | Industry standard, comparable to other papers | Exact word matching, useful baseline |
| ROUGE-L | Better for variable-length responses | Sequence matching regardless of word position |
| BERTScore | Finally understands meaning | "car" and "automobile" are similar |
| Faithfulness | The hallucination catcher | Facts made up vs facts from sources |
| Context Relevancy | Catches retrieval failures | Did we even fetch the right documents? |
| Answer Relevancy | The "did you answer MY question" check | Model answered correctly... but wrong question |
The trade-off with semantic similarity metrics: they're slower (need to load neural models) but much more accurate. The accuracy is worth the speed hit β especially for evaluation, where you're not running in real-time.
The framework is designed so anyone can add their own metrics easily using a factory pattern:
MetricFactory.register("my_metric", MyMetricClass)One line to register, and your metric works everywhere β CLI, pipeline, reports. This took extra effort upfront but makes the system actually extensible.
Why Pydantic instead of plain dictionaries?
- It validates config files before running (fail fast)
- Error messages tell you exactly what's wrong
- Type hints work in IDEs (autocomplete, documentation)
This catches configuration problems immediately rather than failing deep in the evaluation.
To avoid vendor lock-in, the framework supports:
- OpenAI (GPT-4) β highest quality
- Anthropic (Claude) β good alternative
- Groq (Llama) β free tier available!
The same prompt and rubric work across providers, so results are comparable.
API calls fail. Rate limits happen. Instead of crashing, the system retries with exponential backoff:
Attempt 1 fails β wait 1s β retry
Attempt 2 fails β wait 2s β retry
Attempt 3 fails β wait 4s β retry
This sounds simple but makes the difference between a toy project and something you can actually rely on.
Comparison with typical approaches:
| Existing approach | The problem | What I did differently |
|---|---|---|
| Separate scripts for each metric | Hard to compare, inconsistent | Unified pipeline, same interface |
| Config in code | Have to edit Python to change settings | YAML/JSON config files |
| One report format | Either machine OR human readable | Both JSON and Markdown |
| No visualizations | Hard to spot patterns | Histograms + radar charts |
| Single provider lock-in | Expensive or risky | Three LLM providers supported |
The goal was: run one command, get everything you need to evaluate your model.
ββββββββββββββββ
β Your Data β
β (JSONL) β
ββββββββ¬ββββββββ
β
βΌ
ββββββββββββββββ
β Config βββββ YAML file with settings
β (Pydantic) β
ββββββββ¬ββββββββ
β
βΌ
βββββββββββββββββββββββββ
β Evaluation Pipeline β
β β
β 1. Load benchmark β
β 2. Load model output β
β 3. Match examples β
β 4. Compute metrics β
β 5. Run LLM judge β
β 6. Generate reports β
βββββββββββββ¬ββββββββββββ
β
βββββββββ΄ββββββββ
βΌ βΌ
βββββββββββ ββββββββββββ
β JSON β β Markdown β
β Report β β Report β
βββββββββββ ββββββββββββ
Why this structure?
- Separation of concerns: CLI doesn't know about metrics, metrics don't know about reports
- Testability: I can test each layer independently with mocks
- Extensibility: Add a new metric? Just implement the interface. Add a new report format? Same thing.
This is the kind of architecture you see in production systems - not because it's fancy, but because it actually makes the code maintainable.
The benchmark has query-answer pairs. The model output has query-prediction pairs. But what if the queries don't match exactly? Whitespace differences, slight rewording...
Solution: Normalize queries before matching, with fallback to fuzzy matching. The loader handles this transparently.
Loading the sentence-transformer model takes time. Running embeddings on every example is expensive.
Solution: Lazy loading (model loads only when needed) + class-level caching (model loaded once, reused). Also batch processing instead of one-at-a-time.
Sometimes the judge doesn't return valid JSON. Sometimes it adds extra text around the JSON.
Solution: Robust parsing with regex fallbacks. Extract JSON from anywhere in the response. If parsing fails completely, log the error and continue β don't crash the whole evaluation.
When running many evaluations, you hit rate limits.
Solution: Exponential backoff with jitter. The system automatically retries with increasing delays. This handles transient failures gracefully.
-
Metrics are harder than they look. Getting BLEU to handle edge cases (empty strings, single words, very long texts) took way more code than the core algorithm.
-
Configuration is a feature. Good config validation saves hours of debugging. Invest in it early.
-
Multiple evaluation dimensions are essential. A model can score high on BLEU and still be useless. You need to measure what actually matters.
-
Retry logic is not optional. Any system that calls external APIs needs resilient error handling.
-
Testing with mocks is the only way to test API integrations. You can't run real API calls in CI. Mock everything external.
git clone https://github.com/your-org/llm-eval.git
cd llm-eval
poetry installcp .env.example .envAdd at least one key (Groq is free!):
GROQ_API_KEY=gsk_...llm-eval run --config examples/config.yaml --output-dir results/| Command | Purpose |
|---|---|
llm-eval run --config config.yaml |
Run full evaluation |
llm-eval validate --config config.yaml |
Check config validity |
llm-eval list-metrics |
Show available metrics |
Useful flags:
--metrics bleu rouge_lβ run only specific metrics--verboseβ debug logging--no-progressβ CI-friendly output
dataset_path: benchmarks/rag_benchmark.jsonl
output_dir: results
models:
- name: gpt-4
output_path: outputs/gpt4.jsonl
metrics:
bleu: true
rouge_l: true
bertscore: true
faithfulness: true
context_relevancy: true
answer_relevancy: true
llm_judge: true
judge:
provider: groq
model: llama-3.3-70b-versatile
temperature: 0.0from llm_eval.metrics.base import Metric, MetricResult
class MyMetric(Metric):
name = "my_metric"
def compute(self, prediction, reference, **kwargs):
score = your_calculation(prediction, reference)
return MetricResult(score=score)
# Register
from llm_eval.metrics import MetricFactory
MetricFactory.register("my_metric", MyMetric)docker-compose up # Run evaluation
docker-compose --profile test up # Run testsMulti-stage build, non-root user, health checks included.
pytest tests/ -v # All tests
pytest tests/ --cov=llm_eval # With coverage
pytest tests/unit/ -v # Unit onlysrc/llm_eval/
βββ cli.py # Typer commands
βββ config.py # Pydantic models
βββ dataset.py # Data loading
βββ pipeline.py # Orchestration
βββ metrics/ # BLEU, ROUGE, BERTScore, RAG metrics
βββ judges/ # OpenAI, Anthropic, Groq
βββ reporting/ # JSON, Markdown, Charts
βββ utils/ # Logging, retry
Here's what I believe after building this:
Evaluation isn't just a checkbox. It's how you know if your work actually works.
When you deploy an AI system, you're making a promise that it will help users. This framework helps you keep that promise β by measuring what matters, catching what fails, and giving you clear answers instead of vague feelings.
That's what good engineering looks like.
- Fork β Branch β Code β Test β PR
MIT β use it however you want.
Built with care, tested with rigor, documented with clarity.