RFC : Agentic AI Eval Platform : Detail Requirements and User stories

## Requirements
Primary Issue : https://github.com/opensearch-project/dashboards-observability/issues/2588

### Requirement 1: Trace and Observation Ingestion via OTel Collector

**User Story:** As an Application Developer, I want to send trace and observation telemetry from my application so that the Agentic_AI_Eval_Platform captures and stores all execution data for evaluation.

#### Acceptance Criteria

- [ ] 1.1 WHEN the Instrumentation_Library emits an OTLP trace payload, THE OTel_Collector SHALL receive the payload via gRPC or HTTP OTLP receiver and write it to the appropriate OpenSearch Index
- [ ] 1.2 WHEN a Trace document is ingested, THE OTel_Collector SHALL map OTLP span attributes to the Trace schema fields: id, name, timestamp, environment, tags, release, version, input, output, metadata, sessionId, userId, and projectId
- [ ] 1.3 WHEN an Observation document is ingested, THE OTel_Collector SHALL map OTLP span attributes to the Observation schema fields: id, traceId, type, startTime, endTime, name, model, input, output, usage details, cost details, and parentObservationId
- [ ] 1.4 WHEN a Trace contains nested Observations, THE OTel_Collector SHALL preserve the parent-child hierarchy using parentObservationId references
- [ ] 1.5 IF the OTel_Collector receives a malformed OTLP payload, THEN THE OTel_Collector SHALL reject the payload with a descriptive error and not write partial data to any Index
- [ ] 1.6 WHEN multiple Observations reference the same traceId, THE Agentic_AI_Eval_Platform SHALL store each Observation as a separate document linked to the parent Trace by traceId

### Requirement 2: OpenSearch Index Design for Traces and Observations

**User Story:** As a Platform Administrator, I want traces and observations stored in well-structured OpenSearch indices so that queries, aggregations, and analytics perform efficiently.

#### Acceptance Criteria

- [ ] 2.1 THE Agentic_AI_Eval_Platform SHALL store Traces in a dedicated OpenSearch Index with mappings for: id (keyword), name (text+keyword), timestamp (date), environment (keyword), tags (keyword array), release (keyword), version (keyword), input (object), output (object), metadata (object), sessionId (keyword), userId (keyword), projectId (keyword), bookmarked (boolean), public (boolean), createdAt (date), updatedAt (date)
- [ ] 2.2 THE Agentic_AI_Eval_Platform SHALL store Observations in a dedicated OpenSearch Index with mappings for: id (keyword), traceId (keyword), projectId (keyword), type (keyword), startTime (date), endTime (date), name (text+keyword), model (keyword), level (keyword), parentObservationId (keyword), input (object), output (object), usageDetails (object), costDetails (object), latency (float), and all other Observation schema fields
- [ ] 2.3 WHEN a query filters Traces by projectId and timestamp range, THE Agentic_AI_Eval_Platform SHALL return results within 2 seconds for indices containing up to 10 million documents
- [ ] 2.4 THE Agentic_AI_Eval_Platform SHALL configure Index templates with appropriate shard counts, replica settings, and refresh intervals for the expected data volume

### Requirement 3: Score Storage and Score Config Management

**User Story:** As an Evaluation Engineer, I want to create score configurations and attach scores to traces, observations, sessions, or experiment runs so that evaluation results are structured and queryable.

#### Acceptance Criteria

- [ ] 3.1 WHEN a user creates a Score_Config, THE OSD_Plugin SHALL store it in an OpenSearch Index with fields: id, projectId, name, dataType (NUMERIC, CATEGORICAL, or BOOLEAN), isArchived, minValue, maxValue, categories, and description
- [ ] 3.2 WHEN a Score is submitted via the Instrumentation_Library or OSD_Plugin, THE Agentic_AI_Eval_Platform SHALL validate the Score value against its associated Score_Config (data type, min/max range, allowed categories)
- [ ] 3.3 IF a Score value violates its Score_Config constraints, THEN THE Agentic_AI_Eval_Platform SHALL reject the Score with a descriptive validation error
- [ ] 3.4 THE Agentic_AI_Eval_Platform SHALL store Scores in an OpenSearch Index with fields: id, projectId, name, value, dataType, source (EVAL_ONLINE, EVAL_OFFLINE, SDK, ANNOTATION, or API), traceId, observationId, sessionId, experimentRunId, authorUserId, comment, metadata, configId, timestamp, and environment
- [ ] 3.5 WHEN a Score is submitted with a configId, THE Agentic_AI_Eval_Platform SHALL verify the referenced Score_Config exists and is not archived before accepting the Score
- [ ] 3.6 WHEN a Score is submitted with the same idempotency key as an existing Score, THE Agentic_AI_Eval_Platform SHALL update the existing Score rather than creating a duplicate

### Requirement 4: Eval Set Management

**User Story:** As an Evaluation Engineer, I want to create and manage eval sets of input/expected-output pairs so that I can run offline experiments against my LLM application.

#### Acceptance Criteria

- [ ] 4.1 WHEN a user creates an Eval_Set, THE OSD_Plugin SHALL store it in an OpenSearch Index with fields: id, projectId, name, description, metadata, inputSchema (JSON Schema), expectedOutputSchema (JSON Schema), createdAt, and updatedAt
- [ ] 4.2 WHEN a user adds an Experiment, THE Agentic_AI_Eval_Platform SHALL store it with fields: id, projectId, evalSetId, input, expectedOutput, metadata, sourceTraceId, sourceObservationId, status (ACTIVE or ARCHIVED), validFrom, and createdAt
- [ ] 4.3 WHEN an Eval_Set defines an inputSchema, THE Agentic_AI_Eval_Platform SHALL validate each Experiment input against the JSON Schema before accepting it
- [ ] 4.4 WHEN an Eval_Set defines an expectedOutputSchema, THE Agentic_AI_Eval_Platform SHALL validate each Experiment expectedOutput against the JSON Schema before accepting it
- [ ] 4.5 IF an Experiment input or expectedOutput fails JSON Schema validation, THEN THE Agentic_AI_Eval_Platform SHALL reject the item with a descriptive validation error identifying the schema violation
- [ ] 4.6 WHEN a user updates an Experiment, THE Agentic_AI_Eval_Platform SHALL create a new version by setting validFrom on the new version and preserving the previous version for historical queries
- [ ] 4.7 WHEN a user queries Experiments for an Eval_Set, THE Agentic_AI_Eval_Platform SHALL return only the latest active version of each item by default
- [ ] 4.8 WHEN a user archives an Experiment, THE Agentic_AI_Eval_Platform SHALL set the status to ARCHIVED and exclude the item from default queries while retaining it for historical access
- [ ] 4.9 THE Agentic_AI_Eval_Platform SHALL enforce unique Eval_Set names within a projectId

### Requirement 5: Experiment Runs

**User Story:** As an Evaluation Engineer, I want to run my LLM application against an eval set and record the results so that I can evaluate application performance across test cases.

#### Acceptance Criteria

- [ ] 5.1 WHEN a user initiates an Experiment_Run via the Instrumentation_Library, THE Agentic_AI_Eval_Platform SHALL create an Experiment_Run document with fields: id, projectId, evalSetId, name, description, metadata, and createdAt
- [ ] 5.2 WHEN the application processes an Experiment during an Experiment_Run, THE Agentic_AI_Eval_Platform SHALL create an Experiment_Run_Item linking the Experiment to the resulting Trace via fields: id, projectId, experimentRunId, experimentId, traceId, observationId, and error
- [ ] 5.3 WHEN an Experiment_Run completes, THE OSD_Plugin SHALL display a summary showing total items processed, items with errors, and aggregate scores
- [ ] 5.4 WHEN a user views an Experiment_Run, THE OSD_Plugin SHALL display side-by-side comparison of Experiment expected output and actual Trace output for each Experiment_Run_Item
- [ ] 5.5 THE Agentic_AI_Eval_Platform SHALL enforce unique Experiment_Run names within an evalSetId and projectId combination
- [ ] 5.6 WHEN an Experiment_Run_Item encounters an error during processing, THE Agentic_AI_Eval_Platform SHALL record the error message in the Experiment_Run_Item error field and continue processing remaining items

### Requirement 6: Experiment Execution via SDK

**User Story:** As an Application Developer, I want to programmatically run experiments from my Python or TypeScript code so that I can integrate evaluation into my CI/CD pipeline.

#### Acceptance Criteria

- [ ] 6.1 WHEN a developer calls the experiment runner function in the Instrumentation_Library, THE Instrumentation_Library SHALL fetch the Experiments, execute the provided application function for each item concurrently, and record each result as an Experiment_Run_Item
- [ ] 6.2 WHEN the experiment runner executes an application function for an Experiment, THE Instrumentation_Library SHALL automatically create a Trace for the execution and link it to the Experiment_Run_Item
- [ ] 6.3 WHEN an item-level evaluator is provided, THE Instrumentation_Library SHALL execute the evaluator for each Experiment_Run_Item and submit the resulting Scores
- [ ] 6.4 WHEN a run-level evaluator is provided, THE Instrumentation_Library SHALL execute the evaluator after all items are processed and submit the resulting Score attached to the Experiment_Run
- [ ] 6.5 IF an application function throws an error for a single Experiment, THEN THE Instrumentation_Library SHALL record the error on that Experiment_Run_Item and continue processing remaining items without aborting the run
- [ ] 6.6 WHEN the experiment completes, THE Instrumentation_Library SHALL return a summary object containing the Experiment_Run id, total items, successful items, failed items, and aggregate scores

### Requirement 7: Experiment Execution via UI

**User Story:** As an Evaluation Engineer, I want to run experiments from the OpenSearch Dashboards UI so that I can evaluate prompt versions without writing code.

#### Acceptance Criteria

- [ ] 7.1 WHEN a user selects an Eval_Set and an LLM configuration in the OSD_Plugin, THE OSD_Plugin SHALL execute the LLM against each Experiment and record results as an Experiment_Run
- [ ] 7.2 WHEN the UI experiment completes, THE OSD_Plugin SHALL display results in a table with columns for: Experiment input, expected output, actual output, and any attached Scores
- [ ] 7.3 WHEN multiple Experiment_Runs exist for the same Eval_Set, THE OSD_Plugin SHALL provide a side-by-side comparison view showing outputs and scores across runs
- [ ] 7.4 WHEN a user configures an LLM_Judge for a UI experiment, THE OSD_Plugin SHALL automatically score each Experiment_Run_Item using the configured Evaluator_Template

### Requirement 8: LLM-as-a-Judge Automated Evaluation

**User Story:** As an Evaluation Engineer, I want to configure automated LLM-based evaluators so that traces and observations are scored without manual intervention.

#### Acceptance Criteria

- [ ] 8.1 WHEN a user creates an Evaluator_Template, THE OSD_Plugin SHALL store it with fields: id, projectId, name, prompt template, model configuration (provider, model name, temperature, max tokens), output schema (score name, data type, value mapping), target type (Trace or Observation), and evaluationMode (ONLINE or OFFLINE)
- [ ] 8.2 THE Agentic_AI_Eval_Platform SHALL support LLM providers for LLM_Judge evaluation: Amazon Bedrock, OpenAI, and Anthropic, with model configuration specifying the provider, model identifier, and inference parameters
- [ ] 8.3 WHEN an LLM_Judge evaluation is triggered for a Trace or Observation, THE Job_Scheduler SHALL construct the LLM prompt from the Evaluator_Template, call the configured LLM, parse the response using structured output (JSON mode or function calling where supported), and submit the resulting Score
- [ ] 8.4 WHEN an LLM_Judge evaluation completes, THE Job_Scheduler SHALL create a separate execution Trace capturing the LLM call details (prompt, response, latency, cost) for debugging the evaluator
- [ ] 8.5 IF the LLM response does not conform to the expected output schema, THEN THE Job_Scheduler SHALL record the evaluation as failed with a descriptive error and not submit a Score
- [ ] 8.6 WHEN a user configures an LLM_Judge to run on new Traces matching a filter, THE Job_Scheduler SHALL process matching Traces asynchronously within 60 seconds of ingestion
- [ ] 8.7 WHEN a user views an LLM_Judge Score, THE OSD_Plugin SHALL provide a link to the execution Trace so the user can debug the evaluator reasoning
- [ ] 8.8 WHEN an Evaluator_Template has evaluationMode ONLINE, THE Agentic_AI_Eval_Platform SHALL only expose template variables `{{input}}`, `{{output}}`, and `{{context}}` in the prompt template and SHALL NOT allow `{{expectedOutput}}`
- [ ] 8.9 WHEN an Evaluator_Template has evaluationMode OFFLINE, THE Agentic_AI_Eval_Platform SHALL expose template variables `{{input}}`, `{{output}}`, `{{expectedOutput}}`, and `{{context}}` in the prompt template
- [ ] 8.10 IF a user attempts to assign an OFFLINE Evaluator_Template to an online trace-matching trigger, THEN THE Agentic_AI_Eval_Platform SHALL reject the configuration with a descriptive error explaining that reference-based evaluators require Ground_Truth from an Eval_Set
- [ ] 8.11 THE Agentic_AI_Eval_Platform SHALL support multi-criteria evaluation where a single Evaluator_Template produces multiple Scores from one LLM call, each mapped to a separate score name and data type in the output schema

### Requirement 9: Annotation Queues for Human Evaluation

**User Story:** As an Evaluation Engineer, I want to create annotation queues and assign reviewers so that domain experts can manually evaluate traces and observations.

#### Acceptance Criteria

- [ ] 9.1 WHEN a user creates an Annotation_Queue, THE OSD_Plugin SHALL store it with fields: id, projectId, name, description, scoreConfigIds (list of Score_Configs reviewers must fill), createdAt, and updatedAt
- [ ] 9.2 WHEN a user assigns reviewers to an Annotation_Queue, THE Agentic_AI_Eval_Platform SHALL create assignment records linking each user to the queue
- [ ] 9.3 WHEN a user bulk-adds Traces, Observations, or Sessions to an Annotation_Queue, THE Agentic_AI_Eval_Platform SHALL create an Annotation_Task for each item with status PENDING
- [ ] 9.4 WHEN a reviewer opens an Annotation_Queue, THE OSD_Plugin SHALL present the next PENDING Annotation_Task and lock it to prevent concurrent review by another user
- [ ] 9.5 WHEN a reviewer submits scores for an Annotation_Task, THE Agentic_AI_Eval_Platform SHALL create Score documents with source ANNOTATION and authorUserId, and set the Annotation_Task status to COMPLETED
- [ ] 9.6 WHEN a reviewer submits scores, THE Agentic_AI_Eval_Platform SHALL validate each Score against the Score_Configs specified in the Annotation_Queue
- [ ] 9.7 IF a reviewer abandons a locked Annotation_Task without submitting, THEN THE Agentic_AI_Eval_Platform SHALL release the lock after a configurable timeout and return the task to PENDING status

### Requirement 10: Custom Scores via SDK and API

**User Story:** As an Application Developer, I want to submit custom scores programmatically so that I can integrate user feedback, guardrail results, and custom evaluation pipelines.

#### Acceptance Criteria

- [ ] 10.1 WHEN the Instrumentation_Library submits a Score, THE Agentic_AI_Eval_Platform SHALL accept it with fields: name, value, dataType, traceId or observationId or sessionId or experimentRunId, comment, metadata, and configId
- [ ] 10.2 WHEN a Score is submitted without a configId, THE Agentic_AI_Eval_Platform SHALL accept the Score using the provided dataType and value without config validation
- [ ] 10.3 WHEN a Score is submitted with a configId, THE Agentic_AI_Eval_Platform SHALL validate the value against the Score_Config constraints before accepting
- [ ] 10.4 THE Instrumentation_Library SHALL support submitting Scores with an idempotency key so that retried submissions update rather than duplicate
- [ ] 10.5 WHEN multiple Scores with different names are submitted for the same Trace, THE Agentic_AI_Eval_Platform SHALL store each Score independently and make all queryable

### Requirement 11: Score Analytics and Comparison

**User Story:** As an Evaluation Engineer, I want to analyze and compare scores across different evaluators, models, and time periods so that I can understand evaluation quality and trends.

#### Acceptance Criteria

- [ ] 11.1 WHEN a user opens the score analytics Dashboard, THE OSD_Plugin SHALL display aggregate score distributions grouped by score name and source
- [ ] 11.2 WHEN a user selects two or more score sources for comparison, THE OSD_Plugin SHALL compute and display inter-rater agreement metrics: Pearson correlation and Spearman correlation for numeric scores, Cohen's Kappa for categorical scores, and F1 score for boolean scores
- [ ] 11.3 WHEN a user selects two categorical score sources, THE OSD_Plugin SHALL display a confusion matrix and heatmap visualization
- [ ] 11.4 WHEN a user selects a time range, THE OSD_Plugin SHALL display score trends over time as a line chart
- [ ] 11.5 WHEN computing agreement metrics, THE Agentic_AI_Eval_Platform SHALL only include Traces or Observations that have Scores from all selected sources
- [ ] 11.6 THE Agentic_AI_Eval_Platform SHALL compute all analytics aggregations using OpenSearch aggregation queries without requiring data export

### Requirement 12: Dashboards and Monitoring

**User Story:** As a Platform Administrator, I want configurable dashboards so that I can monitor LLM application performance, cost, and quality metrics.

#### Acceptance Criteria

- [ ] 12.1 WHEN a user creates a custom Dashboard, THE OSD_Plugin SHALL allow adding chart widgets with configurable data sources, aggregation types, filters, and groupings
- [ ] 12.2 THE OSD_Plugin SHALL support chart types: line, bar, time series, and pie
- [ ] 12.3 WHEN a user configures a chart widget, THE OSD_Plugin SHALL allow aggregation across Traces, Observations, Scores, Sessions, and Users with multi-level grouping
- [ ] 12.4 THE OSD_Plugin SHALL provide curated default dashboards for: latency metrics, cost tracking, token usage, and score summaries
- [ ] 12.5 WHEN a Dashboard is loaded, THE OSD_Plugin SHALL execute all widget queries in parallel and render results within 5 seconds for indices containing up to 10 million documents
- [ ] 12.6 WHEN a user applies a global time range filter to a Dashboard, THE OSD_Plugin SHALL propagate the filter to all chart widgets

### Requirement 13: Python Instrumentation Library

**User Story:** As an Application Developer, I want a Python instrumentation library so that I can trace LLM calls, create eval sets, run experiments, and submit scores from my Python application.

#### Acceptance Criteria

- [ ] 13.1 THE Instrumentation_Library SHALL provide decorators and context managers for creating Traces and Observations around Python function calls
- [ ] 13.2 WHEN a decorated function executes, THE Instrumentation_Library SHALL capture input arguments, return values, start time, end time, and any raised exceptions as Observation fields
- [ ] 13.3 THE Instrumentation_Library SHALL emit all telemetry as OTLP-compatible spans and span attributes so that OTel_Collector can ingest them
- [ ] 13.4 THE Instrumentation_Library SHALL provide a client for CRUD operations on Eval_Sets and Experiments via the Agentic_AI_Eval_Platform API
- [ ] 13.5 THE Instrumentation_Library SHALL provide an experiment runner that accepts an Eval_Set id, an application function, and optional evaluator functions, and executes the experiment as specified in Requirement 6
- [ ] 13.6 THE Instrumentation_Library SHALL provide a method for submitting Scores with support for all score types (numeric, categorical, boolean), idempotency keys, and optional configId
- [ ] 13.7 THE Instrumentation_Library SHALL provide integration modules for popular frameworks: OpenAI, LangChain, LangGraph, and Pydantic AI that automatically instrument LLM calls without manual decoration
- [ ] 13.8 WHEN the Instrumentation_Library serializes telemetry for OTLP export, THE Instrumentation_Library SHALL produce valid OTLP payloads that round-trip through OTel_Collector without data loss

### Requirement 14: TypeScript Instrumentation Library

**User Story:** As an Application Developer, I want a TypeScript instrumentation library so that I can trace LLM calls, create eval sets, run experiments, and submit scores from my TypeScript application.

#### Acceptance Criteria

- [ ] 14.1 THE Instrumentation_Library SHALL provide wrapper functions and async context tracking for creating Traces and Observations around TypeScript function calls
- [ ] 14.2 WHEN a wrapped function executes, THE Instrumentation_Library SHALL capture input arguments, return values, start time, end time, and any thrown errors as Observation fields
- [ ] 14.3 THE Instrumentation_Library SHALL emit all telemetry as OTLP-compatible spans and span attributes so that OTel_Collector can ingest them
- [ ] 14.4 THE Instrumentation_Library SHALL provide a client for CRUD operations on Eval_Sets and Experiments via the Agentic_AI_Eval_Platform API
- [ ] 14.5 THE Instrumentation_Library SHALL provide an experiment runner that accepts an Eval_Set id, an application function, and optional evaluator functions, and executes the experiment as specified in Requirement 6
- [ ] 14.6 THE Instrumentation_Library SHALL provide a method for submitting Scores with support for all score types (numeric, categorical, boolean), idempotency keys, and optional configId
- [ ] 14.7 THE Instrumentation_Library SHALL provide integration modules for popular frameworks: OpenAI SDK, LangChain.js, and Vercel AI SDK that automatically instrument LLM calls without manual wrapping
- [ ] 14.8 WHEN the Instrumentation_Library serializes telemetry for OTLP export, THE Instrumentation_Library SHALL produce valid OTLP payloads that round-trip through OTel_Collector without data loss

### Requirement 15: Trace and Observation Browsing

**User Story:** As an Application Developer, I want to browse and search traces and observations in the UI so that I can debug and understand my LLM application behavior.

#### Acceptance Criteria

- [ ] 15.1 WHEN a user opens the trace list view, THE OSD_Plugin SHALL display a paginated, sortable table of Traces with columns: name, timestamp, latency, cost, token usage, tags, and score summary
- [ ] 15.2 WHEN a user applies filters (by name, tag, environment, user, session, time range, or score), THE OSD_Plugin SHALL query the OpenSearch Index and return matching Traces
- [ ] 15.3 WHEN a user opens a Trace detail view, THE OSD_Plugin SHALL display the full Observation tree with parent-child hierarchy, timing waterfall, and input/output for each Observation
- [ ] 15.4 WHEN a user selects an Observation in the detail view, THE OSD_Plugin SHALL display the Observation fields: type, model, input, output, usage, cost, latency, and any attached Scores
- [ ] 15.5 WHEN a user searches Traces by input or output content, THE OSD_Plugin SHALL perform full-text search across Trace and Observation input/output fields

### Requirement 16: Session Management

**User Story:** As an Application Developer, I want to group related traces into sessions so that I can analyze multi-turn conversations and user journeys.

#### Acceptance Criteria

- [ ] 16.1 WHEN a Trace is ingested with a sessionId, THE Agentic_AI_Eval_Platform SHALL associate the Trace with the corresponding Session
- [ ] 16.2 WHEN a user opens a Session detail view, THE OSD_Plugin SHALL display all Traces belonging to that Session in chronological order
- [ ] 16.3 WHEN Scores are attached to a sessionId, THE Agentic_AI_Eval_Platform SHALL store and query them as session-level Scores
- [ ] 16.4 WHEN a user lists Sessions, THE OSD_Plugin SHALL display aggregate metrics: trace count, total latency, total cost, and score summaries

### Requirement 17: OpenSearch Dashboards Plugin Architecture

**User Story:** As a Platform Administrator, I want the evaluation UI built as proper OpenSearch Dashboards plugins so that the platform integrates natively with the OpenSearch ecosystem.

#### Acceptance Criteria

- [ ] 17.1 THE OSD_Plugin SHALL follow the OpenSearch Dashboards plugin architecture with proper plugin registration, navigation entries, and saved object types
- [ ] 17.2 THE OSD_Plugin SHALL use the OpenSearch Dashboards HTTP service for all backend API calls to OpenSearch indices
- [ ] 17.3 THE OSD_Plugin SHALL implement project-based access control so that users only see data belonging to their authorized projects
- [ ] 17.4 WHEN the OSD_Plugin is installed, THE OSD_Plugin SHALL register navigation entries for: Traces, Sessions, Eval Sets, Experiments, Annotation Queues, Scores, Evaluators, and Dashboards
- [ ] 17.5 THE OSD_Plugin SHALL use React components consistent with the OpenSearch Dashboards OUI (OpenSearch UI) component library

### Requirement 18: Async Processing via OpenSearch Job Scheduler

**User Story:** As a Platform Administrator, I want asynchronous tasks like LLM-as-a-Judge evaluations to execute reliably via the OpenSearch Job Scheduler so that both online and offline evaluation processing integrates natively with the OpenSearch ecosystem without a custom worker service.

#### Acceptance Criteria

- [ ] 18.1 THE Job_Scheduler SHALL register scheduled jobs for pending evaluation tasks and execute them asynchronously via the OpenSearch Job Scheduler plugin
- [ ] 18.2 WHEN an LLM_Judge evaluation job is queued, THE Job_Scheduler SHALL pick up the job, execute the LLM call, and submit the resulting Score within the configured timeout
- [ ] 18.3 IF a Job_Scheduler job fails, THEN THE Job_Scheduler SHALL retry the job up to a configurable maximum retry count with exponential backoff
- [ ] 18.4 WHEN a Job_Scheduler job exceeds the maximum retry count, THE Job_Scheduler SHALL mark the job as failed and record the error for operator review
- [ ] 18.5 THE Job_Scheduler SHALL leverage the OpenSearch Job Scheduler's built-in distributed execution to support horizontal scaling so that multiple cluster nodes can process jobs concurrently without duplicate execution
- [ ] 18.6 THE Job_Scheduler SHALL record job execution metrics (queue depth, processing time, success rate, failure rate) in an OpenSearch Index for monitoring
- [ ] 18.7 THE Job_Scheduler SHALL support separate job queues or priority levels for Online_Evaluation jobs (low-latency, single-item) and Offline_Evaluation jobs (batch, high-throughput) so that online evaluations are not starved by large offline batch runs
- [ ] 18.8 WHEN an Experiment_Run triggers batch scoring for multiple Experiment_Run_Items, THE Job_Scheduler SHALL create individual evaluation jobs for each item and process them concurrently up to a configurable concurrency limit

### Requirement 19: Single-Trace Visualizations (Trace Map and Debug Timeline)

**User Story:** As an Application Developer debugging a complex agentic flow, I want graph and timeline visualizations of a single trace so that I can understand execution structure, parallel branches, and performance bottlenecks.

#### Acceptance Criteria

- [ ] 19.1 WHEN a user opens a Trace detail view, THE OSD_Plugin SHALL provide a toggle to switch between the observation tree view (Req 15.3) and a Trace Map graph view
- [ ] 19.2 WHEN the Trace Map view is active, THE OSD_Plugin SHALL render a directed graph where nodes represent Observations and edges represent parent-child relationships derived from parentObservationId
- [ ] 19.3 THE Trace Map SHALL visually distinguish node types (span, generation, tool call, retrieval) using distinct icons or colors per observation type
- [ ] 19.4 WHEN a user hovers over a Trace Map node, THE OSD_Plugin SHALL display a tooltip with the Observation name, type, model, latency, cost, and score summary
- [ ] 19.5 WHEN a user clicks a Trace Map node, THE OSD_Plugin SHALL display the full Observation detail panel with input, output, usage, and attached Scores
- [ ] 19.6 THE Trace Map SHALL highlight the critical path (longest sequential chain of observations by cumulative latency) with a distinct visual indicator
- [ ] 19.7 WHEN a Trace contains parallel branches (multiple Observations sharing the same parentObservationId), THE Trace Map SHALL render them as parallel paths in the graph layout
- [ ] 19.8 WHEN a user opens the Debug Timeline view, THE OSD_Plugin SHALL render a waterfall/swim-lane visualization showing each Observation as a horizontal bar positioned by startTime and endTime, with concurrent observations on parallel lanes
- [ ] 19.9 THE Debug Timeline SHALL overlay token usage and cost metrics on each observation bar so that cost hotspots are visually identifiable
- [ ] 19.10 WHEN a user zooms into a time range on the Debug Timeline, THE OSD_Plugin SHALL re-render the visible observations at higher detail within the selected range

### Requirement 20: Multi-Trace Analytics Visualizations (Agent Map and Agent Path)

**User Story:** As an Evaluation Engineer analyzing agentic application behavior at scale, I want aggregate topology and path visualizations across multiple traces so that I can understand system architecture, common execution patterns, and failure paths.

#### Acceptance Criteria

- [ ] 20.1 WHEN a user opens the Agent Map view at project level, THE OSD_Plugin SHALL render a directed graph where nodes represent distinct observation types or names (e.g., "Planner Agent", "Search Tool") and edges represent call relationships, aggregated across all traces matching the current filters
- [ ] 20.2 WHEN a user opens the Agent Map view at session level, THE OSD_Plugin SHALL render the same topology graph scoped to traces within the selected Session
- [ ] 20.3 THE Agent Map edges SHALL display aggregate metrics: call count, average latency, error rate, and average cost per edge
- [ ] 20.4 THE Agent Map nodes SHALL display aggregate metrics: total invocations, average latency, total cost, and average score (if scores are attached)
- [ ] 20.5 WHEN a user clicks an Agent Map edge, THE OSD_Plugin SHALL display a list of individual traces that traversed that edge, linked to their Trace detail views
- [ ] 20.6 WHEN a user applies filters (time range, environment, tags, score range) to the Agent Map, THE OSD_Plugin SHALL recompute the topology using only matching traces
- [ ] 20.7 WHEN a user opens the Agent Path (Sankey) view, THE OSD_Plugin SHALL render a Sankey diagram where each vertical band represents a step in the execution sequence (ordered by observation position in the trace), and flow width represents the number of traces that followed that path
- [ ] 20.8 THE Agent Path SHALL derive execution paths by extracting the ordered sequence of observation types per trace and aggregating identical paths into flow bands
- [ ] 20.9 WHEN a user filters the Agent Path by score range, THE OSD_Plugin SHALL highlight or isolate paths taken by traces within the selected score range (e.g., "show paths for low-scoring traces only")
- [ ] 20.10 WHEN a user clicks a flow segment in the Agent Path, THE OSD_Plugin SHALL display the list of traces that followed that specific path
- [ ] 20.11 THE Agent Path SHALL support both live production traces (filtered by time range) and Experiment_Run traces (filtered by Experiment_Run id)
- [ ] 20.12 THE OSD_Plugin SHALL compute all Agent Map and Agent Path aggregations using OpenSearch aggregation queries without requiring data export

### Requirement 21: Built-in Deterministic Evaluators

**User Story:** As an Evaluation Engineer, I want to configure deterministic evaluators from the UI so that I can score traces and experiment results using programmatic criteria without writing custom code or consuming LLM tokens.

#### Acceptance Criteria

- [ ] 21.1 THE Agentic_AI_Eval_Platform SHALL provide the following built-in Deterministic_Evaluator types, each configurable from the OSD_Plugin without writing code:

| Evaluator               | Score Type | Input                                | Description                                                                                                                             |
| ----------------------- | ---------- | ------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------- |
| Exact Match             | BOOLEAN    | output, expectedOutput               | Returns true if output exactly matches expectedOutput (with optional case-insensitive and whitespace-normalized modes)                  |
| Contains                | BOOLEAN    | output, search string(s)             | Returns true if output contains all specified substrings or matches a regex pattern                                                     |
| JSON Validity           | BOOLEAN    | output                               | Returns true if output is valid JSON                                                                                                    |
| JSON Schema Conformance | BOOLEAN    | output, JSON Schema                  | Returns true if output is valid JSON conforming to the provided schema                                                                  |
| Regex Match             | BOOLEAN    | output, regex pattern                | Returns true if output matches the provided regex pattern                                                                               |
| Levenshtein Distance    | NUMERIC    | output, expectedOutput               | Returns the normalized edit distance (0.0 to 1.0) between output and expectedOutput                                                     |
| Cosine Similarity       | NUMERIC    | output, expectedOutput               | Returns the cosine similarity (0.0 to 1.0) between embedding vectors of output and expectedOutput, using a configurable embedding model |
| Latency Threshold       | BOOLEAN    | trace/observation latency, threshold | Returns true if latency is within the configured threshold                                                                              |
| Cost Threshold          | BOOLEAN    | trace/observation cost, threshold    | Returns true if cost is within the configured threshold                                                                                 |
| Token Count             | NUMERIC    | trace/observation usage              | Returns the total token count from usage metadata                                                                                       |

- [ ] 21.2 WHEN a user creates a Deterministic_Evaluator, THE OSD_Plugin SHALL store it in an OpenSearch Index with fields: id, projectId, name, evaluatorType, configuration (type-specific parameters), evaluationMode (ONLINE or OFFLINE), target type (Trace or Observation), and scoreConfigId
- [ ] 21.3 WHEN a Deterministic_Evaluator with evaluationMode OFFLINE uses expectedOutput (Exact Match, Levenshtein Distance, Cosine Similarity), THE Agentic_AI_Eval_Platform SHALL only allow it to be assigned to Experiment_Run scoring and SHALL reject assignment to online trace-matching triggers
- [ ] 21.4 WHEN a Deterministic_Evaluator is triggered, THE Job_Scheduler SHALL execute the evaluator logic, compute the score, and submit the resulting Score document with source EVAL
- [ ] 21.5 WHEN the Cosine Similarity evaluator is used, THE Agentic_AI_Eval_Platform SHALL call a configurable embedding model (Amazon Bedrock or OpenAI embeddings) to generate vectors before computing similarity
- [ ] 21.6 THE Agentic_AI_Eval_Platform SHALL allow Deterministic_Evaluators to be assigned to the same online trace-matching triggers and offline Experiment_Run scoring as LLM_Judge evaluators
- [ ] 21.7 WHEN a user configures an Experiment_Run or online trigger, THE OSD_Plugin SHALL allow selecting a mix of LLM_Judge Evaluator_Templates and Deterministic_Evaluators to run together
- [ ] 21.8 THE Deterministic_Evaluator execution SHALL NOT create an execution Trace (unlike LLM_Judge), since deterministic evaluations are lightweight and do not require debugging of evaluator reasoning

> **Future Enhancement:** Support for user-defined custom evaluator functions (Python/TypeScript) that can be uploaded and executed server-side by the Job_Scheduler. This would enable arbitrary scoring logic beyond the built-in evaluator types. Deferred due to security implications of running user-provided code.

### Requirement 22: RAG Evaluation via Ragas Framework

**User Story:** As an Evaluation Engineer building RAG applications, I want built-in RAG evaluation metrics based on the Ragas framework so that I can assess retrieval quality, answer faithfulness, and context relevance without implementing custom evaluators.

#### Acceptance Criteria

- [ ] 22.1 THE Agentic_AI_Eval_Platform SHALL extract RAG_Context from Traces by identifying retrieval-type Observations (type = "retrieval" or a configurable observation name filter) and collecting their output fields as the retrieved context documents
- [ ] 22.2 THE Agentic_AI_Eval_Platform SHALL provide the following built-in RAG evaluation metrics, each implemented as a pre-configured Evaluator_Template using LLM-as-a-Judge:

| Metric             | Score Type        | Inputs                           | evaluationMode    | Description                                                                           |
| ------------------ | ----------------- | -------------------------------- | ----------------- | ------------------------------------------------------------------------------------- |
| Faithfulness       | NUMERIC (0.0–1.0) | answer, contexts                 | ONLINE or OFFLINE | Measures whether claims in the answer are supported by the retrieved contexts         |
| Answer Relevancy   | NUMERIC (0.0–1.0) | question, answer                 | ONLINE or OFFLINE | Measures whether the answer addresses the original question                           |
| Context Precision  | NUMERIC (0.0–1.0) | question, contexts, ground_truth | OFFLINE only      | Measures whether relevant contexts are ranked higher than irrelevant ones             |
| Context Recall     | NUMERIC (0.0–1.0) | contexts, ground_truth           | OFFLINE only      | Measures whether the retrieved contexts cover the information in the ground truth     |
| Context Relevancy  | NUMERIC (0.0–1.0) | question, contexts               | ONLINE or OFFLINE | Measures whether the retrieved contexts are relevant to the question                  |
| Answer Correctness | NUMERIC (0.0–1.0) | answer, ground_truth             | OFFLINE only      | Measures factual overlap between the answer and the ground truth                      |
| Answer Similarity  | NUMERIC (0.0–1.0) | answer, ground_truth             | OFFLINE only      | Measures semantic similarity between the answer and the ground truth using embeddings |

- [ ] 22.3 WHEN a user selects a RAG evaluation metric from the OSD_Plugin, THE OSD_Plugin SHALL display the metric's required inputs and automatically map them from the Trace structure: question from trace input, answer from trace output, contexts from RAG_Context observations, and ground_truth from Experiment expectedOutput (offline only)
- [ ] 22.4 THE Agentic_AI_Eval_Platform SHALL expose additional template variables `{{contexts}}` and `{{question}}` for RAG Evaluator_Templates, where `{{contexts}}` is the concatenated or structured list of RAG_Context documents and `{{question}}` is the trace input
- [ ] 22.5 WHEN a RAG metric requires ground_truth (Context Precision, Context Recall, Answer Correctness, Answer Similarity), THE Agentic_AI_Eval_Platform SHALL enforce evaluationMode OFFLINE and reject assignment to online trace-matching triggers
- [ ] 22.6 WHEN a RAG metric uses only question, answer, and contexts (Faithfulness, Answer Relevancy, Context Relevancy), THE Agentic_AI_Eval_Platform SHALL allow both ONLINE and OFFLINE evaluationMode
- [ ] 22.7 THE Agentic_AI_Eval_Platform SHALL allow users to customize the underlying LLM prompt for each RAG metric while preserving the metric's scoring logic and output schema
- [ ] 22.8 WHEN the Answer Similarity metric is used, THE Agentic_AI_Eval_Platform SHALL compute similarity using a configurable embedding model (Amazon Bedrock or OpenAI embeddings) rather than an LLM judge call
- [ ] 22.9 WHEN a user runs a RAG evaluation suite, THE OSD_Plugin SHALL allow selecting multiple RAG metrics to execute together and display results in a unified RAG evaluation dashboard showing all metric scores per trace or Experiment_Run_Item
- [ ] 22.10 THE Instrumentation_Library SHALL provide a helper method to tag retrieval observations with type "retrieval" and structure the output as a list of context documents, ensuring RAG_Context extraction works correctly
- [ ] 22.11 WHEN RAG evaluation scores are displayed, THE OSD_Plugin SHALL show the extracted contexts alongside the scores so users can understand why a particular faithfulness or relevancy score was assigned

Evaluator	Score Type	Input	Description
Exact Match	BOOLEAN	output, expectedOutput	Returns true if output exactly matches expectedOutput (with optional case-insensitive and whitespace-normalized modes)
Contains	BOOLEAN	output, search string(s)	Returns true if output contains all specified substrings or matches a regex pattern
JSON Validity	BOOLEAN	output	Returns true if output is valid JSON
JSON Schema Conformance	BOOLEAN	output, JSON Schema	Returns true if output is valid JSON conforming to the provided schema
Regex Match	BOOLEAN	output, regex pattern	Returns true if output matches the provided regex pattern
Levenshtein Distance	NUMERIC	output, expectedOutput	Returns the normalized edit distance (0.0 to 1.0) between output and expectedOutput
Cosine Similarity	NUMERIC	output, expectedOutput	Returns the cosine similarity (0.0 to 1.0) between embedding vectors of output and expectedOutput, using a configurable embedding model
Latency Threshold	BOOLEAN	trace/observation latency, threshold	Returns true if latency is within the configured threshold
Cost Threshold	BOOLEAN	trace/observation cost, threshold	Returns true if cost is within the configured threshold
Token Count	NUMERIC	trace/observation usage	Returns the total token count from usage metadata

Metric	Score Type	Inputs	evaluationMode	Description
Faithfulness	NUMERIC (0.0–1.0)	answer, contexts	ONLINE or OFFLINE	Measures whether claims in the answer are supported by the retrieved contexts
Answer Relevancy	NUMERIC (0.0–1.0)	question, answer	ONLINE or OFFLINE	Measures whether the answer addresses the original question
Context Precision	NUMERIC (0.0–1.0)	question, contexts, ground_truth	OFFLINE only	Measures whether relevant contexts are ranked higher than irrelevant ones
Context Recall	NUMERIC (0.0–1.0)	contexts, ground_truth	OFFLINE only	Measures whether the retrieved contexts cover the information in the ground truth
Context Relevancy	NUMERIC (0.0–1.0)	question, contexts	ONLINE or OFFLINE	Measures whether the retrieved contexts are relevant to the question
Answer Correctness	NUMERIC (0.0–1.0)	answer, ground_truth	OFFLINE only	Measures factual overlap between the answer and the ground truth
Answer Similarity	NUMERIC (0.0–1.0)	answer, ground_truth	OFFLINE only	Measures semantic similarity between the answer and the ground truth using embeddings

RFC : Agentic AI Eval Platform : Detail Requirements and User stories #2590

Description

Requirements

Requirement 1: Trace and Observation Ingestion via OTel Collector

Acceptance Criteria

Requirement 2: OpenSearch Index Design for Traces and Observations

Acceptance Criteria

Requirement 3: Score Storage and Score Config Management

Acceptance Criteria

Requirement 4: Eval Set Management

Acceptance Criteria

Requirement 5: Experiment Runs

Acceptance Criteria

Requirement 6: Experiment Execution via SDK

Acceptance Criteria

Requirement 7: Experiment Execution via UI

Acceptance Criteria

Requirement 8: LLM-as-a-Judge Automated Evaluation

Acceptance Criteria

Requirement 9: Annotation Queues for Human Evaluation

Acceptance Criteria

Requirement 10: Custom Scores via SDK and API

Acceptance Criteria

Requirement 11: Score Analytics and Comparison

Acceptance Criteria

Requirement 12: Dashboards and Monitoring

Acceptance Criteria

Requirement 13: Python Instrumentation Library

Acceptance Criteria

Requirement 14: TypeScript Instrumentation Library

Acceptance Criteria

Requirement 15: Trace and Observation Browsing

Acceptance Criteria

Requirement 16: Session Management

Acceptance Criteria

Requirement 17: OpenSearch Dashboards Plugin Architecture

Acceptance Criteria

Requirement 18: Async Processing via OpenSearch Job Scheduler

Acceptance Criteria

Requirement 19: Single-Trace Visualizations (Trace Map and Debug Timeline)

Acceptance Criteria

Requirement 20: Multi-Trace Analytics Visualizations (Agent Map and Agent Path)

Acceptance Criteria

Requirement 21: Built-in Deterministic Evaluators

Acceptance Criteria

Requirement 22: RAG Evaluation via Ragas Framework

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions