Skip to content

v1.0.0

Latest

Choose a tag to compare

@NVJKKartik NVJKKartik released this 28 Feb 07:25
b7fa3f9

Unified evaluate() API — One function for all evaluation needs. Local metrics, cloud templates, and LLM-as-judge through a single entrypoint.

  from fi.evals import evaluate                                                                                                                                                                                    

  result = evaluate("faithfulness", output="...", context="...")
  result = evaluate("faithfulness", output="...", context="...", model="gemini/gemini-3.0-flash", augment=True)
  result = evaluate(prompt="Rate empathy...", output="...", engine="llm", model="gemini/gemini-3.0-flash")

Added

  • 72 built-in metrics — faithfulness, hallucination detection (DeBERTa NLI), groundedness, context recall/precision, answer relevancy, function call accuracy, agent trajectory, code security (SQL injection, XSS,
    path traversal, command injection, sensitive data exposure), PII detection, prompt injection, toxicity, string matching, JSON/API schema validation, and more
  • 3 evaluation engines — local (< 5ms, zero API calls), turing (cloud-hosted models), llm (any model via LiteLLM)
  • augment=True — hybrid evaluation: fast local heuristic first, LLM refines for edge cases
  • Multimodal LLM judge — pass image_url or audio_url to evaluate images/audio with Gemini, GPT-4V, or Claude
  • generate_prompt=True — auto-generate grading criteria from a short description
  • Feedback loop — submit human corrections, calibrate pass/fail thresholds statistically, inject corrections as few-shot examples
  • Streaming evaluation — score responses in real-time as they're generated
  • AutoEval — automatic metric recommendation based on your use case (RAG, chatbot, agent)
  • OpenTelemetry integration — gen_ai.evaluation.* semantic conventions, auto-enrichment for existing traces
  • 9 cookbooks — local metrics, LLM judge, RAG pipelines, guardrails, streaming, autoeval, OTEL, feedback loop, multimodal

Changed

  • Package manager migrated from Poetry to uv
  • Evaluator class is now legacy — evaluate() is the recommended API

Infrastructure

  • Distributed backends: Celery, Ray, Temporal, Kubernetes
  • ChromaDB-backed feedback store with semantic vector search
  • Docker compose files for all backends