Unified evaluate() API — One function for all evaluation needs. Local metrics, cloud templates, and LLM-as-judge through a single entrypoint.
from fi.evals import evaluate
result = evaluate("faithfulness", output="...", context="...")
result = evaluate("faithfulness", output="...", context="...", model="gemini/gemini-3.0-flash", augment=True)
result = evaluate(prompt="Rate empathy...", output="...", engine="llm", model="gemini/gemini-3.0-flash")Added
- 72 built-in metrics — faithfulness, hallucination detection (DeBERTa NLI), groundedness, context recall/precision, answer relevancy, function call accuracy, agent trajectory, code security (SQL injection, XSS,
path traversal, command injection, sensitive data exposure), PII detection, prompt injection, toxicity, string matching, JSON/API schema validation, and more - 3 evaluation engines — local (< 5ms, zero API calls), turing (cloud-hosted models), llm (any model via LiteLLM)
- augment=True — hybrid evaluation: fast local heuristic first, LLM refines for edge cases
- Multimodal LLM judge — pass image_url or audio_url to evaluate images/audio with Gemini, GPT-4V, or Claude
- generate_prompt=True — auto-generate grading criteria from a short description
- Feedback loop — submit human corrections, calibrate pass/fail thresholds statistically, inject corrections as few-shot examples
- Streaming evaluation — score responses in real-time as they're generated
- AutoEval — automatic metric recommendation based on your use case (RAG, chatbot, agent)
- OpenTelemetry integration — gen_ai.evaluation.* semantic conventions, auto-enrichment for existing traces
- 9 cookbooks — local metrics, LLM judge, RAG pipelines, guardrails, streaming, autoeval, OTEL, feedback loop, multimodal
Changed
- Package manager migrated from Poetry to uv
- Evaluator class is now legacy — evaluate() is the recommended API
Infrastructure
- Distributed backends: Celery, Ray, Temporal, Kubernetes
- ChromaDB-backed feedback store with semantic vector search
- Docker compose files for all backends