Release v1.0.0 · future-agi/ai-evaluation

Unified evaluate() API — One function for all evaluation needs. Local metrics, cloud templates, and LLM-as-judge through a single entrypoint.

  from fi.evals import evaluate                                                                                                                                                                                    

  result = evaluate("faithfulness", output="...", context="...")
  result = evaluate("faithfulness", output="...", context="...", model="gemini/gemini-3.0-flash", augment=True)
  result = evaluate(prompt="Rate empathy...", output="...", engine="llm", model="gemini/gemini-3.0-flash")

Added

72 built-in metrics — faithfulness, hallucination detection (DeBERTa NLI), groundedness, context recall/precision, answer relevancy, function call accuracy, agent trajectory, code security (SQL injection, XSS,
path traversal, command injection, sensitive data exposure), PII detection, prompt injection, toxicity, string matching, JSON/API schema validation, and more
3 evaluation engines — local (< 5ms, zero API calls), turing (cloud-hosted models), llm (any model via LiteLLM)
augment=True — hybrid evaluation: fast local heuristic first, LLM refines for edge cases
Multimodal LLM judge — pass image_url or audio_url to evaluate images/audio with Gemini, GPT-4V, or Claude
generate_prompt=True — auto-generate grading criteria from a short description
Feedback loop — submit human corrections, calibrate pass/fail thresholds statistically, inject corrections as few-shot examples
Streaming evaluation — score responses in real-time as they're generated
AutoEval — automatic metric recommendation based on your use case (RAG, chatbot, agent)
OpenTelemetry integration — gen_ai.evaluation.* semantic conventions, auto-enrichment for existing traces
9 cookbooks — local metrics, LLM judge, RAG pipelines, guardrails, streaming, autoeval, OTEL, feedback loop, multimodal

Changed

Package manager migrated from Poetry to uv
Evaluator class is now legacy — evaluate() is the recommended API

Infrastructure

Distributed backends: Celery, Ray, Temporal, Kubernetes
ChromaDB-backed feedback store with semantic vector search
Docker compose files for all backends

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.0.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Unified evaluate() API — One function for all evaluation needs. Local metrics, cloud templates, and LLM-as-judge through a single entrypoint.

Uh oh!