This document provides a detailed overview of the system architecture, design decisions, and component interactions.
The LLM Evaluation Framework is designed as a modular, extensible system for evaluating Large Language Model outputs. It follows clean architecture principles with clear separation of concerns.
┌──────────────────────────────────────────────────────────────────────────────┐
│ CLI Layer │
│ ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ Typer CLI (cli.py) │ │
│ │ - Command parsing - Progress display - Exit codes │ │
│ └────────────────────────────────────────────────────────────────────────┘ │
├──────────────────────────────────────────────────────────────────────────────┤
│ Orchestration Layer │
│ ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ Pipeline (pipeline.py) │ │
│ │ - Coordinates evaluation - Manages metrics - Generates reports │ │
│ └────────────────────────────────────────────────────────────────────────┘ │
├──────────────────────────────────────────────────────────────────────────────┤
│ Core Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ Config │ │ Dataset │ │ Metrics │ │ Judges │ │
│ │ (Pydantic) │ │ Loader │ │ (6 types) │ │ (3 providers) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────────┘ │
├──────────────────────────────────────────────────────────────────────────────┤
│ Output Layer │
│ ┌──────────────────────┐ ┌──────────────────────┐ ┌────────────────────┐ │
│ │ JSON Reporter │ │ Markdown Reporter │ │ Visualizer │ │
│ │ (machine-readable) │ │ (human-readable) │ │ (PNG charts) │ │
│ └──────────────────────┘ └──────────────────────┘ └────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────────┘
src/llm_eval/
├── __init__.py # Package initialization
├── cli.py # Typer CLI commands
├── config.py # Pydantic configuration models
├── dataset.py # Dataset loading and validation
├── pipeline.py # Evaluation orchestration
│
├── metrics/ # Metric implementations
│ ├── __init__.py # Exports and factory registration
│ ├── base.py # Abstract Metric class
│ ├── bleu.py # BLEU score
│ ├── rouge.py # ROUGE-L score
│ ├── bertscore.py # BERTScore
│ ├── faithfulness.py # Faithfulness metric
│ ├── context_relevancy.py # Context relevancy
│ └── answer_relevancy.py # Answer relevancy
│
├── judges/ # LLM-as-a-Judge implementations
│ ├── __init__.py # Exports and factory
│ ├── base.py # Abstract Judge class
│ ├── openai_judge.py # GPT-4 integration
│ ├── anthropic_judge.py # Claude integration
│ └── groq_judge.py # Groq/Llama integration
│
├── reporting/ # Report generators
│ ├── __init__.py # Exports
│ ├── json_reporter.py # JSON output
│ ├── markdown_reporter.py # Markdown output
│ └── visualizer.py # Chart generation
│
└── utils/ # Utilities
├── __init__.py # Exports
├── logging.py # Logging configuration
└── retry.py # Retry with backoff
The configuration system uses Pydantic for robust validation:
class EvaluationConfig(BaseModel):
dataset_path: Path # Benchmark dataset
output_dir: Path # Results directory
models: List[ModelConfig] # Models to evaluate
metrics: MetricsConfig # Which metrics to compute
judge: JudgeConfig # LLM judge settingsKey Features:
- YAML and JSON support
- Environment variable integration for API keys
- Early validation with clear error messages
- Type hints for all fields
Handles loading and validation of benchmark data:
class DatasetLoader:
def load_benchmark(self) -> List[BenchmarkExample]
def load_model_outputs(self) -> List[ModelOutput]
def iter_benchmark(self) -> Iterator[BenchmarkExample]Formats Supported:
- JSONL (JSON Lines)
- CSV with proper escaping
Required Fields:
query: The input questionexpected_answer: Reference answerretrieved_contexts: List of context passages
All metrics inherit from a common base class:
class Metric(ABC):
name: str
description: str
@abstractmethod
def compute(
self,
prediction: str,
reference: str,
query: Optional[str] = None,
contexts: Optional[List[str]] = None,
**kwargs
) -> MetricResultMetric Types:
| Category | Metrics | Implementation |
|---|---|---|
| Reference-based | BLEU, ROUGE-L, BERTScore | NLTK, rouge-score, sentence-transformers |
| RAG-specific | Faithfulness, Context Relevancy, Answer Relevancy | Embedding similarity |
| LLM-based | Coherence, Relevance, Safety | API calls to LLMs |
LLM-as-a-Judge implementations for nuanced evaluation:
class Judge(ABC):
def evaluate(
self,
query: str,
answer: str,
contexts: Optional[List[str]] = None,
reference: Optional[str] = None
) -> JudgeResultSupported Providers:
- OpenAI (GPT-4)
- Anthropic (Claude)
- Groq (Llama 3.1, Mixtral)
Features:
- Structured JSON output
- Multi-dimensional rubric
- Retry with exponential backoff
- Response parsing with fallbacks
Orchestrates the entire evaluation process:
class EvaluationPipeline:
def __init__(self, config: EvaluationConfig)
def run(self) -> Dict[str, Dict[str, Any]]
def evaluate_model(self, model_name, output_path, benchmark)Responsibilities:
- Load benchmark dataset
- Load model outputs
- Match outputs to examples
- Compute all enabled metrics
- Run LLM judge (if enabled)
- Generate reports and visualizations
Generates output in multiple formats:
JSON Reporter:
- Machine-readable format
- Aggregate statistics (mean, median, std, min, max)
- Per-example breakdowns
- Error tracking
Markdown Reporter:
- Human-readable format
- Formatted tables
- Color-coded scores
- Sample results
Visualizer:
- Histogram per metric
- Radar chart for multi-model comparison
- PNG output at high resolution
┌─────────────┐ ┌─────────────┐ ┌─────────────────┐
│ Config │────▶│ Pipeline │────▶│ Dataset Loader │
│ (YAML/JSON)│ │ │ │ │
└─────────────┘ └──────┬──────┘ └────────┬────────┘
│ │
▼ ▼
┌─────────────┐ ┌─────────────────┐
│ Metrics │◀────│ Benchmark + │
│ Factory │ │ Model Outputs │
└──────┬──────┘ └─────────────────┘
│
┌────────────────┼────────────────┐
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ BLEU │ │ BERTScore │ │ LLM │
│ ROUGE-L │ │ RAG Metrics│ │ Judge │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└────────────────┼────────────────┘
▼
┌─────────────────┐
│ Aggregation │
│ & Statistics │
└────────┬────────┘
│
┌──────────────┼──────────────┐
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ JSON │ │ Markdown │ │ Charts │
│ Report │ │ Report │ │ (PNG) │
└─────────────┘ └─────────────┘ └─────────────┘
class MetricFactory:
_registry: Dict[str, Type[Metric]] = {}
@classmethod
def register(cls, name: str, metric_class: Type[Metric]):
cls._registry[name] = metric_class
@classmethod
def create(cls, name: str, **kwargs) -> Metric:
return cls._registry[name](**kwargs)Each metric implements the same interface, allowing them to be used interchangeably:
for metric in self._metrics:
result = metric.compute(prediction, reference, query, contexts)The Judge base class defines the evaluation template:
class Judge(ABC):
def evaluate(self, query, answer, contexts, reference):
prompt = self._build_prompt(...) # Template step
response = self._call_llm(prompt) # Abstract step
return self._parse_response(response) # Template step@retry_with_backoff(max_retries=3, min_wait=1.0, max_wait=30.0)
def _call_llm(self, prompt: str) -> str:
return self.client.chat.completions.create(...)- Create a new file in
src/llm_eval/metrics/:
# my_metric.py
from llm_eval.metrics.base import Metric, MetricResult
class MyMetric(Metric):
name = "my_metric"
def compute(self, prediction, reference, **kwargs):
score = your_calculation(prediction, reference)
return MetricResult(score=score)- Register in
metrics/__init__.py:
from llm_eval.metrics.my_metric import MyMetric
MetricFactory.register("my_metric", MyMetric)- Create a new file in
src/llm_eval/judges/:
# custom_judge.py
from llm_eval.judges.base import Judge
class CustomJudge(Judge):
def _call_llm(self, prompt: str) -> str:
# Implement your API call
return response- Register in
judges/__init__.py:
from llm_eval.judges.custom_judge import CustomJudge
def create_judge(provider, **kwargs):
providers = {
"custom": CustomJudge,
# ...
}- Create a new reporter in
src/llm_eval/reporting/:
class HTMLReporter:
def generate(self, results, model_name):
# Generate HTML report
pass- Integrate into the pipeline's
_generate_reportsmethod.
BERTScore and embedding-based metrics batch encode texts:
def compute_batch(self, predictions, references):
embeddings = self._model.encode(all_texts, batch_size=32)
# Process in batch...Sentence transformer models are cached at class level:
class BERTScoreMetric(Metric):
_model: Optional[SentenceTransformer] = None
def _ensure_model_loaded(self):
if BERTScoreMetric._model is None:
BERTScoreMetric._model = SentenceTransformer(self.model_name)LLM judge calls use exponential backoff to handle rate limits:
@retry_with_backoff(max_retries=3, min_wait=1.0, max_wait=30.0)
def _call_llm(self, prompt):
# API call with automatic retry on rate limit- API Keys: Stored in environment variables, never in code
- Input Validation: Pydantic models validate all inputs
- Container Security: Non-root user in Docker
- CI/CD: Trivy security scanning in GitHub Actions