-
Notifications
You must be signed in to change notification settings - Fork 78
RFC : Agentic AI - Observability, Analytics and Eval Platform : Requirements Document #2588
Description
Requirements Document
High Level Design : #2592
Goal and Vision
Goal
Build an open-source LLM evaluation platform natively on the OpenSearch ecosystem that gives teams a single place to trace, evaluate, and improve their LLM applications without introducing external databases, custom worker services, or proprietary infrastructure.
The platform uses OpenSearch indices for all storage, OTel Collector for OTLP telemetry ingestion, OpenSearch Job Scheduler for async processing, and OpenSearch Dashboards plugins for the UI. New Python and TypeScript instrumentation libraries provide the developer-facing SDK.
Why Evaluation Matters for LLM and Agent Development
Evaluation closes the gap between "it seems to work" and "we have evidence it works." Specifically:
- Catching regressions before deployment. A prompt tweak that improves one use case can silently degrade another. Running experiments against curated eval sets before shipping is the only reliable way to detect this.
- Understanding agent behavior at scale. Agentic flows involve branching tool calls, retries, and multi-step reasoning. Without trace-level observability and scoring, teams cannot identify which steps fail, which tools are misused, or where latency accumulates.
- Quantifying quality continuously. Production traffic surfaces edge cases that no test set anticipates. Online evaluation (LLM-as-a-Judge, deterministic checks, user feedback) scores live traces automatically, turning production into a continuous quality signal.
- Building ground truth iteratively. The best eval sets grow from production data. Teams identify failing traces, add them to eval sets with corrected expected outputs, and use those sets to validate future changes. This feedback loop is the core of LLM application improvement.
Without evaluation infrastructure, teams rely on manual spot-checking, anecdotal user reports, and gut feel none of which scale as applications move from prototype to production.
Why OpenSearch
OpenSearch is already the observability backbone for thousands of organizations. Many teams running LLM applications already send logs, metrics, and traces to OpenSearch. Adding LLM evaluation as a native capability means:
- No new infrastructure to operate. Teams already managing OpenSearch clusters do not need to provision separate databases (Postgres, ClickHouse), deploy additional web servers, or run custom worker processes. Evaluation data lives in the same cluster as the rest of their observability data.
- Unified query surface. Traces, scores, experiments, and analytics are all queryable via standard OpenSearch APIs. Teams can correlate LLM evaluation results with existing application logs, infrastructure metrics, and APM data in a single query layer.
- Existing access control and governance. OpenSearch's security model (index-level permissions, field-level security, audit logging) applies to evaluation data without additional configuration. Organizations with compliance requirements do not need to evaluate a new system's security posture.
- Scale characteristics that match the workload. OpenSearch handles high-throughput document ingestion (trace and score volume), full-text and structured search (trace browsing and filtering), and time-series aggregations (score analytics and dashboards) all workloads that LLM evaluation demands.
- OTLP ingestion via OTel Collector. Applications instrument once with OpenTelemetry and send traces through OTel Collector. The same pipeline that ingests APM traces can ingest LLM evaluation telemetry with additional processors for schema mapping.
For OpenSearch customers, this means LLM evaluation becomes a natural extension of their existing investment not a separate platform with its own operational burden, learning curve, and cost structure.
Vision
The long-term vision is a complete evaluation lifecycle within OpenSearch:
-
Instrument Developers add a few lines of SDK code. Traces flow through OTLP into OTel Collector and land in OpenSearch indices. No new databases to provision. RFC : Agentic AI Eval Platform : opensearch-genai-sdk-py #2591
-
Evaluate Teams configure LLM-as-a-Judge evaluators, deterministic checks, RAG metrics, and human annotation queues all from OpenSearch Dashboards. Online evaluation scores production traces automatically. Offline evaluation runs experiments against curated eval sets.
-
Analyze Score analytics, experiment comparison, and custom dashboards surface quality trends, regressions, and cost/latency breakdowns. Everything is queryable via standard OpenSearch APIs.
-
Iterate Production traces feed back into eval sets. Experiments run in CI/CD via the SDK. The evaluation loop tightens with each deployment.
Design Principles
- OpenSearch-native: Every component (storage, compute, scheduling, UI) uses OpenSearch ecosystem primitives. No external databases, no custom worker processes, no separate web servers.
- OTLP-first: Telemetry ingestion uses the OpenTelemetry Protocol via OTel Collector. Applications instrument once and can send data to any OTLP-compatible backend.
- Evaluation-mode aware: The platform distinguishes online evaluation (reference-free, real-time, on live traces) from offline evaluation (reference-based, batch, against eval sets with ground truth). This distinction is enforced at the template, scheduling, and UI levels.
- SDK-driven workflow: The Python and TypeScript libraries are the primary interface for developers. They handle tracing, eval set management, experiment execution, and score submission all over standard HTTP/gRPC protocols.
- Pluggable evaluators: LLM-as-a-Judge, deterministic checks, RAG metrics (Ragas), and human annotation are all first-class evaluation methods. Teams mix and match based on their needs.
- Progressive complexity: A team can start with just tracing (Req 1-2), add automated scoring (Req 8), then grow into full experiment workflows (Req 4-7) and analytics (Req 11-12) as their evaluation practice matures.
Introduction
This document specifies the requirements for creating a LLM evaluation platform to the OpenSearch ecosystem. The goal is to deliver a fully functional LLM evaluation platform : covering tracing, eval sets, experiments, scoring, annotation, LLM-as-a-Judge, and analytics : built natively on OpenSearch infrastructure.
Frontend UI is built with OpenSearch Dashboards plugins, Telemetry and metadata with OpenSearch indices, custom HTTP/OLTP ingestion with OTel Collector OTLP pipelines, and the instrumentation SDKs with new Python(p0) and TypeScript(p1) instrumentation libraries.
Glossary
- Agentic_AI_Eval_Platform: The overall OpenSearch-based LLM evaluation system
- OSD_Plugin: An OpenSearch Dashboards plugin providing the user interface
- OTel_Collector: The OpenSearch telemetry ingestion pipeline that receives OTLP data
- Trace: A top-level record representing a single end-to-end execution of an LLM application
- Observation: A child record within a Trace representing an individual operation (span, generation, tool call, etc.)
- Score: An evaluation result (numeric, categorical, boolean) attached to a Trace, Observation, Session, or Experiment_Run
- Score_Config: A schema definition that standardizes how a Score is structured (name, data type, value range, categories)
- Eval_Set: A versioned collection of input/expected-output pairs used for offline evaluation
- Experiment: An individual test case within an Eval_Set
- Experiment_Run: A single execution of an application against an Eval_Set
- Experiment_Run_Item: An individual result linking an Experiment to the Trace produced during an Experiment_Run
- Annotation_Queue: A queue of Traces, Observations, or Sessions assigned to human reviewers for manual evaluation
- Annotation_Task: A single item within an Annotation_Queue awaiting human review
- LLM_Judge: An automated evaluator that uses an LLM to score Traces or Observations
- Evaluator_Template: A reusable configuration for an LLM_Judge including prompt template, model selection, output schema, and evaluation mode (ONLINE or OFFLINE)
- Deterministic_Evaluator: A built-in programmatic evaluator that scores Traces or Observations using deterministic logic (e.g., exact match, regex, JSON validity, cosine similarity) without requiring an LLM call
- RAG_Context: The set of retrieved documents or passages associated with a retrieval-type Observation within a Trace, used as input for RAG evaluation metrics
- Ground_Truth: The expected or reference output for a given input, available in Eval_Set Experiments as
expectedOutput. Ground_Truth enables reference-based evaluation (e.g., comparing actual output to expected output). Live production traces do not have Ground_Truth. - Session: A grouping of related Traces (e.g., a multi-turn conversation)
- Instrumentation_Library: A Python or TypeScript SDK that instruments LLM applications and emits OTLP telemetry
- OTLP: OpenTelemetry Protocol, the wire format used for telemetry ingestion
- Job_Scheduler: The OpenSearch Job Scheduler plugin used to schedule and execute asynchronous jobs (LLM-as-a-Judge evaluations, eval set operations) with built-in distributed execution support
- Index: An OpenSearch index used to store documents
- Dashboard: A configurable visualization surface within OSD_Plugin for monitoring and analytics
- Online_Evaluation: Real-time evaluation of live production traces as they are ingested, triggered automatically by trace matching rules (e.g., LLM-as-a-Judge on new traces, guardrail checks). Scores are computed platform-side.
- Offline_Evaluation: On-demand batch evaluation of an application against a curated Eval_Set, triggered manually by a developer, CI pipeline, or UI action. Scores are computed platform-side.
- Local_Evaluation: Evaluation computed client-side by the user's SDK (Strands, DeepEval, Ragas, custom code) and submitted as pre-computed scores via the Scores API. The platform is a passive receiver.
Evaluation Modes: Online, Offline, and Local
The Agentic_AI_Eval_Platform recognizes three distinct evaluation modes. The key distinction is who computes the score and who orchestrates execution.
| Aspect | Online (Agent Trace Evaluation) | Offline Agent Trace Evaluation (Platform-orchestrated) | Local (SDK-side) |
|---|---|---|---|
| Who computes scores | Platform (Job Scheduler) | Platform (Job Scheduler) | User's SDK (Strands, DeepEval, Ragas, custom) |
| Who orchestrates execution | Nobody reacts to ingested traces | Platform SDK | User's code |
| Trigger | Automatic on trace ingestion | Manual, CI-triggered (experiment_id) | User-initiated in application code |
| Data source | Live production traces | Curated Eval_Set test cases | Any user controls input |
| Ground_Truth | Not available | Available via Experiment expectedOutput |
User-managed (may or may not exist) |
| Evaluation strategy | Reference-free only (faithfulness, relevance, toxicity, format compliance) | Reference-based and reference-free | Anything the SDK supports |
| Latency target | Within 60 seconds of ingestion | Batch-oriented, no real-time constraint | Determined by user's pipeline |
| Telemetry flow | App → OTLP → OTel Collector → OpenSearch; Platform → LLM Judge → Score | APP → OTLP with experiment metadata → OpenSearch; Platform scores results ( Dataset+Experiment with OTLP Traces ) Evaluate → OTLP → OTel Collector → OpenSearch; | SDK → Scores API → OpenSearch |
| Score source | EVAL_ONLINE |
EVAL_OFFLINE |
SDK |
| Platform role | Active computes and stores | Active orchestrates, computes, and stores | Passive stores and visualizes only |
Online Evaluation
The platform scores traces after ingestion. LLM-as-a-Judge evaluators, deterministic checks, and RAG metrics fire automatically when new traces match configured filters. The user's application is unaware evaluation is happening. Scores are computed server-side by the Job_Scheduler.
Online evaluators can only assess output quality using reference-free criteria (e.g., "is this response faithful to the retrieved context?") because live production traces do not have Ground_Truth.
Each Evaluator_Template declares an evaluationMode (ONLINE or OFFLINE) so the system can enforce that reference-based templates are not assigned to online triggers, and the UI can filter available templates by mode.
Template variables available: {{input}}, {{output}}, {{context}}
Offline Evaluation (Platform-orchestrated)
The platform's experiment runner (SDK or UI) drives execution against an Eval_Set. The platform calls the user's application function per test case, captures traces, and scores results. Ground_Truth is available from the Eval_Set's expectedOutput field, enabling reference-based evaluation (exact match, semantic similarity) in addition to reference-free methods.
Template variables available: {{input}}, {{output}}, {{expectedOutput}}, {{context}}
Local Evaluation (SDK-side)
Third-party SDKs like Strands, DeepEval, and Ragas or custom user code run evaluations locally in the user's process. The SDK computes scores in-process (or calls an LLM judge locally), then sends both the trace telemetry (via OTLP) and the pre-computed scores (via the Scores API) to the platform. The platform is a passive receiver: it stores and visualizes but does not orchestrate or compute anything.
This mode is important because:
- Teams already using DeepEval's
GEval, Ragas metrics, or Strands agent evaluation should not need to re-implement their evaluators as platform Evaluator_Templates. They run what they have and send results. - Local evaluation can happen in CI/CD pipelines, notebooks, or development environments where the platform's Job_Scheduler is not involved.
- Scores arrive pre-computed with metadata about the evaluation library and metric (e.g.,
evaluator: deepeval/geval/joyfulness), enabling the platform to display, filter, and aggregate them alongside platform-computed scores.
For the Instrumentation_Library, this means:
- Accept pre-computed scores from third-party evaluation SDKs and forward them with
source: SDK - Include metadata about the evaluation library and metric name in the score's metadata field
- Correlate scores to traces via traceId, which the OTLP instrumentation already provides
Score Source Field
All three modes produce Score documents stored in the same index, but with different source values to distinguish lineage:
| Source | Description |
|---|---|
EVAL_ONLINE |
Computed by the platform's online evaluation pipeline (Job Scheduler + LLM Judge / deterministic / RAG) |
EVAL_OFFLINE |
Computed by the platform's offline experiment runner |
SDK |
Computed locally by the instrumentation library or a third-party SDK (Strands, DeepEval, Ragas, etc.) and submitted via Scores API |
ANNOTATION |
Human review via annotation queues |
API |
Generic programmatic submission (user feedback, guardrails, custom pipelines) |
Online scores reference live traceIds directly. Offline scores reference experimentRunIds and are linked through Experiment_Run_Items. SDK scores reference traceIds and optionally observationIds, with evaluator metadata in the score's metadata field.
Evaluation Algorithm Dependencies
The Agentic_AI_Eval_Platform does not implement evaluation algorithms from scratch. For both Online and Offline evaluation modes where the platform computes scores server-side, the platform delegates to open-source evaluation libraries for the actual scoring logic:
- Strands Eval Agent trajectory evaluation, tool-use correctness, multi-step reasoning assessment
- DeepEval GEval custom metrics, hallucination detection, answer relevancy, faithfulness, and other LLM-as-a-Judge patterns
- Ragas RAG-specific metrics: context precision, context recall, answer faithfulness, answer relevancy
This means:
- Evaluator_Templates are thin wrappers. An LLM-as-a-Judge template specifies which library, which metric, which LLM provider, and which parameters to use. The platform constructs the library call, not the raw LLM prompt.
- New evaluation methods ship as library updates. When DeepEval adds a new metric or Ragas improves faithfulness scoring, the platform picks it up by upgrading the dependency no platform code changes required.
- Custom evaluators are still supported. Teams can write their own evaluation functions in the SDK (Local mode) or configure deterministic evaluators in the UI. The library dependency applies only to the platform's built-in LLM-based and RAG evaluation methods.
- LLM provider configuration is pluggable. Each Evaluator_Template declares a provider and model. The evaluation library receives this configuration and handles the LLM call. Switching from Bedrock to OpenAI is a template-level change, not a platform change.
Job Scheduling Across Modes
The Job_Scheduler is only involved in Online and Offline modes. Local evaluation bypasses the Job_Scheduler entirely scores arrive via the Scores API as already-computed documents.
For Online and Offline modes, the Job_Scheduler must support both: low-latency single-item jobs for online evaluation and high-throughput batch jobs for offline Experiment_Runs. Queue separation or priority levels ensure online evaluations are not starved by large offline batch runs.
Execution Model
The Agentic_AI_Eval_Platform uses two distinct execution patterns depending on the evaluation context:
Post-run scoring (passive): The eval framework only reads already-captured data. Traces and observations are ingested first, then evaluators score them after the fact. The LLM_Judge, annotation queues, and custom scores all operate on stored trace data : the framework never invokes the user's application. This applies to:
- Online evaluation via LLM_Judge (Req 8)
- Human annotation (Req 9)
- Custom programmatic scores (Req 10)
Experiment execution (active): The offline experiment flow orchestrates application execution. The Instrumentation_Library (Req 6) or OSD_Plugin (Req 7) calls the user's application function with each Experiment's input, captures the resulting trace, and optionally scores the result. The eval framework treats the user's application as a black-box function : it provides input, receives output, and records the trace. It does not manage agent lifecycles, LLM provider connections, or application-internal orchestration.
| Component | Role | Calls user's app? |
|---|---|---|
| Instrumentation_Library (SDK experiment runner) | Orchestrates offline experiment execution | Yes : invokes user-provided function per Experiment |
| OSD_Plugin (UI experiment runner) | Executes LLM directly against Eval_Set | No : calls configured LLM directly, not user's app |
| Job_Scheduler + LLM_Judge | Scores existing traces post-ingestion | No : reads stored trace data only |
| Annotation_Queue | Routes traces to human reviewers | No : presents stored trace data only |
User Personas
| Persona | Description |
|---|---|
| Application Developer | Builds and instruments LLM applications using the Python or TypeScript SDKs. Sends telemetry, browses traces, debugs execution flows, and submits custom scores. |
| Evaluation Engineer | Designs eval sets, configures evaluators (LLM Judge, deterministic, RAG), runs experiments, and analyzes scoring results. May or may not write code. |
| Annotation Reviewer | Domain expert who manually scores traces and observations through annotation queues. Does not write code. |
| Platform Administrator | Manages OpenSearch infrastructure, index design, plugin deployment, job scheduling, and monitoring dashboards. |
User Story Summary
Detailed requirements : #2590
| # | Persona | Requirement | Focus | User Story |
|---|---|---|---|---|
| 1 | Application Developer | Trace and Observation Ingestion via OTel Collector | Instrumentation | As an Application Developer, I want to send trace and observation telemetry from my application so that the Agentic_AI_Eval_Platform captures and stores all execution data for evaluation. |
| 2 | Platform Administrator | OpenSearch Index Design for Traces and Observations | Storage | As a Platform Administrator, I want traces and observations stored in well-structured OpenSearch indices so that queries, aggregations, and analytics perform efficiently. |
| 3 | Evaluation Engineer | Score Storage and Score Config Management | Scoring | As an Evaluation Engineer, I want to create score configurations and attach scores to traces, observations, sessions, or experiment runs so that evaluation results are structured and queryable. |
| 4 | Evaluation Engineer | Eval Set Management | Eval sets | As an Evaluation Engineer, I want to create and manage eval sets of input/expected-output pairs so that I can run offline experiments against my LLM application. |
| 5 | Evaluation Engineer | Experiment Runs | Experiments | As an Evaluation Engineer, I want to run my LLM application against an eval set and record the results so that I can evaluate application performance across test cases. |
| 6 | Application Developer | Experiment Execution via SDK | SDK experiments | As an Application Developer, I want to programmatically run experiments from my Python or TypeScript code so that I can integrate evaluation into my CI/CD pipeline. |
| 7 | Evaluation Engineer | Experiment Execution via UI | UI experiments | As an Evaluation Engineer, I want to run experiments from the OpenSearch Dashboards UI so that I can evaluate prompt versions without writing code. |
| 8 | Evaluation Engineer | LLM-as-a-Judge Automated Evaluation | Automated scoring | As an Evaluation Engineer, I want to configure automated LLM-based evaluators so that traces and observations are scored without manual intervention. |
| 9 | Evaluation Engineer | Annotation Queues for Human Evaluation | Human review | As an Evaluation Engineer, I want to create annotation queues and assign reviewers so that domain experts can manually evaluate traces and observations. |
| 10 | Application Developer | Custom Scores via SDK and API | Custom scoring | As an Application Developer, I want to submit custom scores programmatically so that I can integrate user feedback, guardrail results, and custom evaluation pipelines. |
| 11 | Evaluation Engineer | Score Analytics and Comparison | Analytics | As an Evaluation Engineer, I want to analyze and compare scores across different evaluators, models, and time periods so that I can understand evaluation quality and trends. |
| 12 | Platform Administrator | Dashboards and Monitoring | Monitoring | As a Platform Administrator, I want configurable dashboards so that I can monitor LLM application performance, cost, and quality metrics. |
| 13 | Application Developer | Python Instrumentation Library | Python SDK | As an Application Developer, I want a Python instrumentation library so that I can trace LLM calls, create eval sets, run experiments, and submit scores from my Python application. |
| 14 | Application Developer | TypeScript Instrumentation Library | TypeScript SDK | As an Application Developer, I want a TypeScript instrumentation library so that I can trace LLM calls, create eval sets, run experiments, and submit scores from my TypeScript application. |
| 15 | Application Developer | Trace and Observation Browsing | Debugging | As an Application Developer, I want to browse and search traces and observations in the UI so that I can debug and understand my LLM application behavior. |
| 16 | Application Developer | Session Management | Sessions | As an Application Developer, I want to group related traces into sessions so that I can analyze multi-turn conversations and user journeys. |
| 17 | Platform Administrator | OpenSearch Dashboards Plugin Architecture | Infrastructure | As a Platform Administrator, I want the evaluation UI built as proper OpenSearch Dashboards plugins so that the platform integrates natively with the OpenSearch ecosystem. |
| 18 | Platform Administrator | Async Processing via OpenSearch Job Scheduler | Job scheduling | As a Platform Administrator, I want asynchronous tasks like LLM-as-a-Judge evaluations to execute reliably via the OpenSearch Job Scheduler so that both online and offline evaluation processing integrates natively with the OpenSearch ecosystem without a custom worker service. |
| 19 | Application Developer | Single-Trace Visualizations (Trace Map and Debug Timeline) | Debugging | As an Application Developer debugging a complex agentic flow, I want graph and timeline visualizations of a single trace so that I can understand execution structure, parallel branches, and performance bottlenecks. |
| 20 | Evaluation Engineer | Multi-Trace Analytics Visualizations (Agent Map and Agent Path) | Analytics | As an Evaluation Engineer analyzing agentic application behavior at scale, I want aggregate topology and path visualizations across multiple traces so that I can understand system architecture, common execution patterns, and failure paths. |
| 21 | Evaluation Engineer | Built-in Deterministic Evaluators | Deterministic scoring | As an Evaluation Engineer, I want to configure deterministic evaluators from the UI so that I can score traces and experiment results using programmatic criteria without writing custom code or consuming LLM tokens. |
| 22 | Evaluation Engineer | RAG Evaluation via Ragas Framework | RAG scoring | As an Evaluation Engineer building RAG applications, I want built-in RAG evaluation metrics based on the Ragas framework so that I can assess retrieval quality, answer faithfulness, and context relevance without implementing custom evaluators. |