You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Requirement 1: Trace and Observation Ingestion via OTel Collector
User Story: As an Application Developer, I want to send trace and observation telemetry from my application so that the Agentic_AI_Eval_Platform captures and stores all execution data for evaluation.
Acceptance Criteria
1.1 WHEN the Instrumentation_Library emits an OTLP trace payload, THE OTel_Collector SHALL receive the payload via gRPC or HTTP OTLP receiver and write it to the appropriate OpenSearch Index
1.2 WHEN a Trace document is ingested, THE OTel_Collector SHALL map OTLP span attributes to the Trace schema fields: id, name, timestamp, environment, tags, release, version, input, output, metadata, sessionId, userId, and projectId
1.3 WHEN an Observation document is ingested, THE OTel_Collector SHALL map OTLP span attributes to the Observation schema fields: id, traceId, type, startTime, endTime, name, model, input, output, usage details, cost details, and parentObservationId
1.4 WHEN a Trace contains nested Observations, THE OTel_Collector SHALL preserve the parent-child hierarchy using parentObservationId references
1.5 IF the OTel_Collector receives a malformed OTLP payload, THEN THE OTel_Collector SHALL reject the payload with a descriptive error and not write partial data to any Index
1.6 WHEN multiple Observations reference the same traceId, THE Agentic_AI_Eval_Platform SHALL store each Observation as a separate document linked to the parent Trace by traceId
Requirement 2: OpenSearch Index Design for Traces and Observations
User Story: As a Platform Administrator, I want traces and observations stored in well-structured OpenSearch indices so that queries, aggregations, and analytics perform efficiently.
Acceptance Criteria
2.1 THE Agentic_AI_Eval_Platform SHALL store Traces in a dedicated OpenSearch Index with mappings for: id (keyword), name (text+keyword), timestamp (date), environment (keyword), tags (keyword array), release (keyword), version (keyword), input (object), output (object), metadata (object), sessionId (keyword), userId (keyword), projectId (keyword), bookmarked (boolean), public (boolean), createdAt (date), updatedAt (date)
2.2 THE Agentic_AI_Eval_Platform SHALL store Observations in a dedicated OpenSearch Index with mappings for: id (keyword), traceId (keyword), projectId (keyword), type (keyword), startTime (date), endTime (date), name (text+keyword), model (keyword), level (keyword), parentObservationId (keyword), input (object), output (object), usageDetails (object), costDetails (object), latency (float), and all other Observation schema fields
2.3 WHEN a query filters Traces by projectId and timestamp range, THE Agentic_AI_Eval_Platform SHALL return results within 2 seconds for indices containing up to 10 million documents
2.4 THE Agentic_AI_Eval_Platform SHALL configure Index templates with appropriate shard counts, replica settings, and refresh intervals for the expected data volume
Requirement 3: Score Storage and Score Config Management
User Story: As an Evaluation Engineer, I want to create score configurations and attach scores to traces, observations, sessions, or experiment runs so that evaluation results are structured and queryable.
Acceptance Criteria
3.1 WHEN a user creates a Score_Config, THE OSD_Plugin SHALL store it in an OpenSearch Index with fields: id, projectId, name, dataType (NUMERIC, CATEGORICAL, or BOOLEAN), isArchived, minValue, maxValue, categories, and description
3.2 WHEN a Score is submitted via the Instrumentation_Library or OSD_Plugin, THE Agentic_AI_Eval_Platform SHALL validate the Score value against its associated Score_Config (data type, min/max range, allowed categories)
3.3 IF a Score value violates its Score_Config constraints, THEN THE Agentic_AI_Eval_Platform SHALL reject the Score with a descriptive validation error
3.4 THE Agentic_AI_Eval_Platform SHALL store Scores in an OpenSearch Index with fields: id, projectId, name, value, dataType, source (EVAL_ONLINE, EVAL_OFFLINE, SDK, ANNOTATION, or API), traceId, observationId, sessionId, experimentRunId, authorUserId, comment, metadata, configId, timestamp, and environment
3.5 WHEN a Score is submitted with a configId, THE Agentic_AI_Eval_Platform SHALL verify the referenced Score_Config exists and is not archived before accepting the Score
3.6 WHEN a Score is submitted with the same idempotency key as an existing Score, THE Agentic_AI_Eval_Platform SHALL update the existing Score rather than creating a duplicate
Requirement 4: Eval Set Management
User Story: As an Evaluation Engineer, I want to create and manage eval sets of input/expected-output pairs so that I can run offline experiments against my LLM application.
Acceptance Criteria
4.1 WHEN a user creates an Eval_Set, THE OSD_Plugin SHALL store it in an OpenSearch Index with fields: id, projectId, name, description, metadata, inputSchema (JSON Schema), expectedOutputSchema (JSON Schema), createdAt, and updatedAt
4.2 WHEN a user adds an Experiment, THE Agentic_AI_Eval_Platform SHALL store it with fields: id, projectId, evalSetId, input, expectedOutput, metadata, sourceTraceId, sourceObservationId, status (ACTIVE or ARCHIVED), validFrom, and createdAt
4.3 WHEN an Eval_Set defines an inputSchema, THE Agentic_AI_Eval_Platform SHALL validate each Experiment input against the JSON Schema before accepting it
4.4 WHEN an Eval_Set defines an expectedOutputSchema, THE Agentic_AI_Eval_Platform SHALL validate each Experiment expectedOutput against the JSON Schema before accepting it
4.5 IF an Experiment input or expectedOutput fails JSON Schema validation, THEN THE Agentic_AI_Eval_Platform SHALL reject the item with a descriptive validation error identifying the schema violation
4.6 WHEN a user updates an Experiment, THE Agentic_AI_Eval_Platform SHALL create a new version by setting validFrom on the new version and preserving the previous version for historical queries
4.7 WHEN a user queries Experiments for an Eval_Set, THE Agentic_AI_Eval_Platform SHALL return only the latest active version of each item by default
4.8 WHEN a user archives an Experiment, THE Agentic_AI_Eval_Platform SHALL set the status to ARCHIVED and exclude the item from default queries while retaining it for historical access
4.9 THE Agentic_AI_Eval_Platform SHALL enforce unique Eval_Set names within a projectId
Requirement 5: Experiment Runs
User Story: As an Evaluation Engineer, I want to run my LLM application against an eval set and record the results so that I can evaluate application performance across test cases.
Acceptance Criteria
5.1 WHEN a user initiates an Experiment_Run via the Instrumentation_Library, THE Agentic_AI_Eval_Platform SHALL create an Experiment_Run document with fields: id, projectId, evalSetId, name, description, metadata, and createdAt
5.2 WHEN the application processes an Experiment during an Experiment_Run, THE Agentic_AI_Eval_Platform SHALL create an Experiment_Run_Item linking the Experiment to the resulting Trace via fields: id, projectId, experimentRunId, experimentId, traceId, observationId, and error
5.3 WHEN an Experiment_Run completes, THE OSD_Plugin SHALL display a summary showing total items processed, items with errors, and aggregate scores
5.4 WHEN a user views an Experiment_Run, THE OSD_Plugin SHALL display side-by-side comparison of Experiment expected output and actual Trace output for each Experiment_Run_Item
5.5 THE Agentic_AI_Eval_Platform SHALL enforce unique Experiment_Run names within an evalSetId and projectId combination
5.6 WHEN an Experiment_Run_Item encounters an error during processing, THE Agentic_AI_Eval_Platform SHALL record the error message in the Experiment_Run_Item error field and continue processing remaining items
Requirement 6: Experiment Execution via SDK
User Story: As an Application Developer, I want to programmatically run experiments from my Python or TypeScript code so that I can integrate evaluation into my CI/CD pipeline.
Acceptance Criteria
6.1 WHEN a developer calls the experiment runner function in the Instrumentation_Library, THE Instrumentation_Library SHALL fetch the Experiments, execute the provided application function for each item concurrently, and record each result as an Experiment_Run_Item
6.2 WHEN the experiment runner executes an application function for an Experiment, THE Instrumentation_Library SHALL automatically create a Trace for the execution and link it to the Experiment_Run_Item
6.3 WHEN an item-level evaluator is provided, THE Instrumentation_Library SHALL execute the evaluator for each Experiment_Run_Item and submit the resulting Scores
6.4 WHEN a run-level evaluator is provided, THE Instrumentation_Library SHALL execute the evaluator after all items are processed and submit the resulting Score attached to the Experiment_Run
6.5 IF an application function throws an error for a single Experiment, THEN THE Instrumentation_Library SHALL record the error on that Experiment_Run_Item and continue processing remaining items without aborting the run
6.6 WHEN the experiment completes, THE Instrumentation_Library SHALL return a summary object containing the Experiment_Run id, total items, successful items, failed items, and aggregate scores
Requirement 7: Experiment Execution via UI
User Story: As an Evaluation Engineer, I want to run experiments from the OpenSearch Dashboards UI so that I can evaluate prompt versions without writing code.
Acceptance Criteria
7.1 WHEN a user selects an Eval_Set and an LLM configuration in the OSD_Plugin, THE OSD_Plugin SHALL execute the LLM against each Experiment and record results as an Experiment_Run
7.2 WHEN the UI experiment completes, THE OSD_Plugin SHALL display results in a table with columns for: Experiment input, expected output, actual output, and any attached Scores
7.3 WHEN multiple Experiment_Runs exist for the same Eval_Set, THE OSD_Plugin SHALL provide a side-by-side comparison view showing outputs and scores across runs
7.4 WHEN a user configures an LLM_Judge for a UI experiment, THE OSD_Plugin SHALL automatically score each Experiment_Run_Item using the configured Evaluator_Template
User Story: As an Evaluation Engineer, I want to configure automated LLM-based evaluators so that traces and observations are scored without manual intervention.
Acceptance Criteria
8.1 WHEN a user creates an Evaluator_Template, THE OSD_Plugin SHALL store it with fields: id, projectId, name, prompt template, model configuration (provider, model name, temperature, max tokens), output schema (score name, data type, value mapping), target type (Trace or Observation), and evaluationMode (ONLINE or OFFLINE)
8.2 THE Agentic_AI_Eval_Platform SHALL support LLM providers for LLM_Judge evaluation: Amazon Bedrock, OpenAI, and Anthropic, with model configuration specifying the provider, model identifier, and inference parameters
8.3 WHEN an LLM_Judge evaluation is triggered for a Trace or Observation, THE Job_Scheduler SHALL construct the LLM prompt from the Evaluator_Template, call the configured LLM, parse the response using structured output (JSON mode or function calling where supported), and submit the resulting Score
8.4 WHEN an LLM_Judge evaluation completes, THE Job_Scheduler SHALL create a separate execution Trace capturing the LLM call details (prompt, response, latency, cost) for debugging the evaluator
8.5 IF the LLM response does not conform to the expected output schema, THEN THE Job_Scheduler SHALL record the evaluation as failed with a descriptive error and not submit a Score
8.6 WHEN a user configures an LLM_Judge to run on new Traces matching a filter, THE Job_Scheduler SHALL process matching Traces asynchronously within 60 seconds of ingestion
8.7 WHEN a user views an LLM_Judge Score, THE OSD_Plugin SHALL provide a link to the execution Trace so the user can debug the evaluator reasoning
8.8 WHEN an Evaluator_Template has evaluationMode ONLINE, THE Agentic_AI_Eval_Platform SHALL only expose template variables {{input}}, {{output}}, and {{context}} in the prompt template and SHALL NOT allow {{expectedOutput}}
8.9 WHEN an Evaluator_Template has evaluationMode OFFLINE, THE Agentic_AI_Eval_Platform SHALL expose template variables {{input}}, {{output}}, {{expectedOutput}}, and {{context}} in the prompt template
8.10 IF a user attempts to assign an OFFLINE Evaluator_Template to an online trace-matching trigger, THEN THE Agentic_AI_Eval_Platform SHALL reject the configuration with a descriptive error explaining that reference-based evaluators require Ground_Truth from an Eval_Set
8.11 THE Agentic_AI_Eval_Platform SHALL support multi-criteria evaluation where a single Evaluator_Template produces multiple Scores from one LLM call, each mapped to a separate score name and data type in the output schema
Requirement 9: Annotation Queues for Human Evaluation
User Story: As an Evaluation Engineer, I want to create annotation queues and assign reviewers so that domain experts can manually evaluate traces and observations.
Acceptance Criteria
9.1 WHEN a user creates an Annotation_Queue, THE OSD_Plugin SHALL store it with fields: id, projectId, name, description, scoreConfigIds (list of Score_Configs reviewers must fill), createdAt, and updatedAt
9.2 WHEN a user assigns reviewers to an Annotation_Queue, THE Agentic_AI_Eval_Platform SHALL create assignment records linking each user to the queue
9.3 WHEN a user bulk-adds Traces, Observations, or Sessions to an Annotation_Queue, THE Agentic_AI_Eval_Platform SHALL create an Annotation_Task for each item with status PENDING
9.4 WHEN a reviewer opens an Annotation_Queue, THE OSD_Plugin SHALL present the next PENDING Annotation_Task and lock it to prevent concurrent review by another user
9.5 WHEN a reviewer submits scores for an Annotation_Task, THE Agentic_AI_Eval_Platform SHALL create Score documents with source ANNOTATION and authorUserId, and set the Annotation_Task status to COMPLETED
9.6 WHEN a reviewer submits scores, THE Agentic_AI_Eval_Platform SHALL validate each Score against the Score_Configs specified in the Annotation_Queue
9.7 IF a reviewer abandons a locked Annotation_Task without submitting, THEN THE Agentic_AI_Eval_Platform SHALL release the lock after a configurable timeout and return the task to PENDING status
Requirement 10: Custom Scores via SDK and API
User Story: As an Application Developer, I want to submit custom scores programmatically so that I can integrate user feedback, guardrail results, and custom evaluation pipelines.
Acceptance Criteria
10.1 WHEN the Instrumentation_Library submits a Score, THE Agentic_AI_Eval_Platform SHALL accept it with fields: name, value, dataType, traceId or observationId or sessionId or experimentRunId, comment, metadata, and configId
10.2 WHEN a Score is submitted without a configId, THE Agentic_AI_Eval_Platform SHALL accept the Score using the provided dataType and value without config validation
10.3 WHEN a Score is submitted with a configId, THE Agentic_AI_Eval_Platform SHALL validate the value against the Score_Config constraints before accepting
10.4 THE Instrumentation_Library SHALL support submitting Scores with an idempotency key so that retried submissions update rather than duplicate
10.5 WHEN multiple Scores with different names are submitted for the same Trace, THE Agentic_AI_Eval_Platform SHALL store each Score independently and make all queryable
Requirement 11: Score Analytics and Comparison
User Story: As an Evaluation Engineer, I want to analyze and compare scores across different evaluators, models, and time periods so that I can understand evaluation quality and trends.
Acceptance Criteria
11.1 WHEN a user opens the score analytics Dashboard, THE OSD_Plugin SHALL display aggregate score distributions grouped by score name and source
11.2 WHEN a user selects two or more score sources for comparison, THE OSD_Plugin SHALL compute and display inter-rater agreement metrics: Pearson correlation and Spearman correlation for numeric scores, Cohen's Kappa for categorical scores, and F1 score for boolean scores
11.3 WHEN a user selects two categorical score sources, THE OSD_Plugin SHALL display a confusion matrix and heatmap visualization
11.4 WHEN a user selects a time range, THE OSD_Plugin SHALL display score trends over time as a line chart
11.5 WHEN computing agreement metrics, THE Agentic_AI_Eval_Platform SHALL only include Traces or Observations that have Scores from all selected sources
11.6 THE Agentic_AI_Eval_Platform SHALL compute all analytics aggregations using OpenSearch aggregation queries without requiring data export
Requirement 12: Dashboards and Monitoring
User Story: As a Platform Administrator, I want configurable dashboards so that I can monitor LLM application performance, cost, and quality metrics.
Acceptance Criteria
12.1 WHEN a user creates a custom Dashboard, THE OSD_Plugin SHALL allow adding chart widgets with configurable data sources, aggregation types, filters, and groupings
12.2 THE OSD_Plugin SHALL support chart types: line, bar, time series, and pie
12.3 WHEN a user configures a chart widget, THE OSD_Plugin SHALL allow aggregation across Traces, Observations, Scores, Sessions, and Users with multi-level grouping
12.4 THE OSD_Plugin SHALL provide curated default dashboards for: latency metrics, cost tracking, token usage, and score summaries
12.5 WHEN a Dashboard is loaded, THE OSD_Plugin SHALL execute all widget queries in parallel and render results within 5 seconds for indices containing up to 10 million documents
12.6 WHEN a user applies a global time range filter to a Dashboard, THE OSD_Plugin SHALL propagate the filter to all chart widgets
Requirement 13: Python Instrumentation Library
User Story: As an Application Developer, I want a Python instrumentation library so that I can trace LLM calls, create eval sets, run experiments, and submit scores from my Python application.
Acceptance Criteria
13.1 THE Instrumentation_Library SHALL provide decorators and context managers for creating Traces and Observations around Python function calls
13.2 WHEN a decorated function executes, THE Instrumentation_Library SHALL capture input arguments, return values, start time, end time, and any raised exceptions as Observation fields
13.3 THE Instrumentation_Library SHALL emit all telemetry as OTLP-compatible spans and span attributes so that OTel_Collector can ingest them
13.4 THE Instrumentation_Library SHALL provide a client for CRUD operations on Eval_Sets and Experiments via the Agentic_AI_Eval_Platform API
13.5 THE Instrumentation_Library SHALL provide an experiment runner that accepts an Eval_Set id, an application function, and optional evaluator functions, and executes the experiment as specified in Requirement 6
13.6 THE Instrumentation_Library SHALL provide a method for submitting Scores with support for all score types (numeric, categorical, boolean), idempotency keys, and optional configId
13.7 THE Instrumentation_Library SHALL provide integration modules for popular frameworks: OpenAI, LangChain, LangGraph, and Pydantic AI that automatically instrument LLM calls without manual decoration
13.8 WHEN the Instrumentation_Library serializes telemetry for OTLP export, THE Instrumentation_Library SHALL produce valid OTLP payloads that round-trip through OTel_Collector without data loss
User Story: As an Application Developer, I want a TypeScript instrumentation library so that I can trace LLM calls, create eval sets, run experiments, and submit scores from my TypeScript application.
Acceptance Criteria
14.1 THE Instrumentation_Library SHALL provide wrapper functions and async context tracking for creating Traces and Observations around TypeScript function calls
14.2 WHEN a wrapped function executes, THE Instrumentation_Library SHALL capture input arguments, return values, start time, end time, and any thrown errors as Observation fields
14.3 THE Instrumentation_Library SHALL emit all telemetry as OTLP-compatible spans and span attributes so that OTel_Collector can ingest them
14.4 THE Instrumentation_Library SHALL provide a client for CRUD operations on Eval_Sets and Experiments via the Agentic_AI_Eval_Platform API
14.5 THE Instrumentation_Library SHALL provide an experiment runner that accepts an Eval_Set id, an application function, and optional evaluator functions, and executes the experiment as specified in Requirement 6
14.6 THE Instrumentation_Library SHALL provide a method for submitting Scores with support for all score types (numeric, categorical, boolean), idempotency keys, and optional configId
14.7 THE Instrumentation_Library SHALL provide integration modules for popular frameworks: OpenAI SDK, LangChain.js, and Vercel AI SDK that automatically instrument LLM calls without manual wrapping
14.8 WHEN the Instrumentation_Library serializes telemetry for OTLP export, THE Instrumentation_Library SHALL produce valid OTLP payloads that round-trip through OTel_Collector without data loss
Requirement 15: Trace and Observation Browsing
User Story: As an Application Developer, I want to browse and search traces and observations in the UI so that I can debug and understand my LLM application behavior.
Acceptance Criteria
15.1 WHEN a user opens the trace list view, THE OSD_Plugin SHALL display a paginated, sortable table of Traces with columns: name, timestamp, latency, cost, token usage, tags, and score summary
15.2 WHEN a user applies filters (by name, tag, environment, user, session, time range, or score), THE OSD_Plugin SHALL query the OpenSearch Index and return matching Traces
15.3 WHEN a user opens a Trace detail view, THE OSD_Plugin SHALL display the full Observation tree with parent-child hierarchy, timing waterfall, and input/output for each Observation
15.4 WHEN a user selects an Observation in the detail view, THE OSD_Plugin SHALL display the Observation fields: type, model, input, output, usage, cost, latency, and any attached Scores
15.5 WHEN a user searches Traces by input or output content, THE OSD_Plugin SHALL perform full-text search across Trace and Observation input/output fields
Requirement 16: Session Management
User Story: As an Application Developer, I want to group related traces into sessions so that I can analyze multi-turn conversations and user journeys.
Acceptance Criteria
16.1 WHEN a Trace is ingested with a sessionId, THE Agentic_AI_Eval_Platform SHALL associate the Trace with the corresponding Session
16.2 WHEN a user opens a Session detail view, THE OSD_Plugin SHALL display all Traces belonging to that Session in chronological order
16.3 WHEN Scores are attached to a sessionId, THE Agentic_AI_Eval_Platform SHALL store and query them as session-level Scores
16.4 WHEN a user lists Sessions, THE OSD_Plugin SHALL display aggregate metrics: trace count, total latency, total cost, and score summaries
User Story: As a Platform Administrator, I want the evaluation UI built as proper OpenSearch Dashboards plugins so that the platform integrates natively with the OpenSearch ecosystem.
Acceptance Criteria
17.1 THE OSD_Plugin SHALL follow the OpenSearch Dashboards plugin architecture with proper plugin registration, navigation entries, and saved object types
17.2 THE OSD_Plugin SHALL use the OpenSearch Dashboards HTTP service for all backend API calls to OpenSearch indices
17.3 THE OSD_Plugin SHALL implement project-based access control so that users only see data belonging to their authorized projects
17.4 WHEN the OSD_Plugin is installed, THE OSD_Plugin SHALL register navigation entries for: Traces, Sessions, Eval Sets, Experiments, Annotation Queues, Scores, Evaluators, and Dashboards
17.5 THE OSD_Plugin SHALL use React components consistent with the OpenSearch Dashboards OUI (OpenSearch UI) component library
Requirement 18: Async Processing via OpenSearch Job Scheduler
User Story: As a Platform Administrator, I want asynchronous tasks like LLM-as-a-Judge evaluations to execute reliably via the OpenSearch Job Scheduler so that both online and offline evaluation processing integrates natively with the OpenSearch ecosystem without a custom worker service.
Acceptance Criteria
18.1 THE Job_Scheduler SHALL register scheduled jobs for pending evaluation tasks and execute them asynchronously via the OpenSearch Job Scheduler plugin
18.2 WHEN an LLM_Judge evaluation job is queued, THE Job_Scheduler SHALL pick up the job, execute the LLM call, and submit the resulting Score within the configured timeout
18.3 IF a Job_Scheduler job fails, THEN THE Job_Scheduler SHALL retry the job up to a configurable maximum retry count with exponential backoff
18.4 WHEN a Job_Scheduler job exceeds the maximum retry count, THE Job_Scheduler SHALL mark the job as failed and record the error for operator review
18.5 THE Job_Scheduler SHALL leverage the OpenSearch Job Scheduler's built-in distributed execution to support horizontal scaling so that multiple cluster nodes can process jobs concurrently without duplicate execution
18.6 THE Job_Scheduler SHALL record job execution metrics (queue depth, processing time, success rate, failure rate) in an OpenSearch Index for monitoring
18.7 THE Job_Scheduler SHALL support separate job queues or priority levels for Online_Evaluation jobs (low-latency, single-item) and Offline_Evaluation jobs (batch, high-throughput) so that online evaluations are not starved by large offline batch runs
18.8 WHEN an Experiment_Run triggers batch scoring for multiple Experiment_Run_Items, THE Job_Scheduler SHALL create individual evaluation jobs for each item and process them concurrently up to a configurable concurrency limit
Requirement 19: Single-Trace Visualizations (Trace Map and Debug Timeline)
User Story: As an Application Developer debugging a complex agentic flow, I want graph and timeline visualizations of a single trace so that I can understand execution structure, parallel branches, and performance bottlenecks.
Acceptance Criteria
19.1 WHEN a user opens a Trace detail view, THE OSD_Plugin SHALL provide a toggle to switch between the observation tree view (Req 15.3) and a Trace Map graph view
19.2 WHEN the Trace Map view is active, THE OSD_Plugin SHALL render a directed graph where nodes represent Observations and edges represent parent-child relationships derived from parentObservationId
19.3 THE Trace Map SHALL visually distinguish node types (span, generation, tool call, retrieval) using distinct icons or colors per observation type
19.4 WHEN a user hovers over a Trace Map node, THE OSD_Plugin SHALL display a tooltip with the Observation name, type, model, latency, cost, and score summary
19.5 WHEN a user clicks a Trace Map node, THE OSD_Plugin SHALL display the full Observation detail panel with input, output, usage, and attached Scores
19.6 THE Trace Map SHALL highlight the critical path (longest sequential chain of observations by cumulative latency) with a distinct visual indicator
19.7 WHEN a Trace contains parallel branches (multiple Observations sharing the same parentObservationId), THE Trace Map SHALL render them as parallel paths in the graph layout
19.8 WHEN a user opens the Debug Timeline view, THE OSD_Plugin SHALL render a waterfall/swim-lane visualization showing each Observation as a horizontal bar positioned by startTime and endTime, with concurrent observations on parallel lanes
19.9 THE Debug Timeline SHALL overlay token usage and cost metrics on each observation bar so that cost hotspots are visually identifiable
19.10 WHEN a user zooms into a time range on the Debug Timeline, THE OSD_Plugin SHALL re-render the visible observations at higher detail within the selected range
Requirement 20: Multi-Trace Analytics Visualizations (Agent Map and Agent Path)
User Story: As an Evaluation Engineer analyzing agentic application behavior at scale, I want aggregate topology and path visualizations across multiple traces so that I can understand system architecture, common execution patterns, and failure paths.
Acceptance Criteria
20.1 WHEN a user opens the Agent Map view at project level, THE OSD_Plugin SHALL render a directed graph where nodes represent distinct observation types or names (e.g., "Planner Agent", "Search Tool") and edges represent call relationships, aggregated across all traces matching the current filters
20.2 WHEN a user opens the Agent Map view at session level, THE OSD_Plugin SHALL render the same topology graph scoped to traces within the selected Session
20.3 THE Agent Map edges SHALL display aggregate metrics: call count, average latency, error rate, and average cost per edge
20.4 THE Agent Map nodes SHALL display aggregate metrics: total invocations, average latency, total cost, and average score (if scores are attached)
20.5 WHEN a user clicks an Agent Map edge, THE OSD_Plugin SHALL display a list of individual traces that traversed that edge, linked to their Trace detail views
20.6 WHEN a user applies filters (time range, environment, tags, score range) to the Agent Map, THE OSD_Plugin SHALL recompute the topology using only matching traces
20.7 WHEN a user opens the Agent Path (Sankey) view, THE OSD_Plugin SHALL render a Sankey diagram where each vertical band represents a step in the execution sequence (ordered by observation position in the trace), and flow width represents the number of traces that followed that path
20.8 THE Agent Path SHALL derive execution paths by extracting the ordered sequence of observation types per trace and aggregating identical paths into flow bands
20.9 WHEN a user filters the Agent Path by score range, THE OSD_Plugin SHALL highlight or isolate paths taken by traces within the selected score range (e.g., "show paths for low-scoring traces only")
20.10 WHEN a user clicks a flow segment in the Agent Path, THE OSD_Plugin SHALL display the list of traces that followed that specific path
20.11 THE Agent Path SHALL support both live production traces (filtered by time range) and Experiment_Run traces (filtered by Experiment_Run id)
20.12 THE OSD_Plugin SHALL compute all Agent Map and Agent Path aggregations using OpenSearch aggregation queries without requiring data export
Requirement 21: Built-in Deterministic Evaluators
User Story: As an Evaluation Engineer, I want to configure deterministic evaluators from the UI so that I can score traces and experiment results using programmatic criteria without writing custom code or consuming LLM tokens.
Acceptance Criteria
21.1 THE Agentic_AI_Eval_Platform SHALL provide the following built-in Deterministic_Evaluator types, each configurable from the OSD_Plugin without writing code:
Evaluator
Score Type
Input
Description
Exact Match
BOOLEAN
output, expectedOutput
Returns true if output exactly matches expectedOutput (with optional case-insensitive and whitespace-normalized modes)
Contains
BOOLEAN
output, search string(s)
Returns true if output contains all specified substrings or matches a regex pattern
JSON Validity
BOOLEAN
output
Returns true if output is valid JSON
JSON Schema Conformance
BOOLEAN
output, JSON Schema
Returns true if output is valid JSON conforming to the provided schema
Regex Match
BOOLEAN
output, regex pattern
Returns true if output matches the provided regex pattern
Levenshtein Distance
NUMERIC
output, expectedOutput
Returns the normalized edit distance (0.0 to 1.0) between output and expectedOutput
Cosine Similarity
NUMERIC
output, expectedOutput
Returns the cosine similarity (0.0 to 1.0) between embedding vectors of output and expectedOutput, using a configurable embedding model
Latency Threshold
BOOLEAN
trace/observation latency, threshold
Returns true if latency is within the configured threshold
Cost Threshold
BOOLEAN
trace/observation cost, threshold
Returns true if cost is within the configured threshold
Token Count
NUMERIC
trace/observation usage
Returns the total token count from usage metadata
21.2 WHEN a user creates a Deterministic_Evaluator, THE OSD_Plugin SHALL store it in an OpenSearch Index with fields: id, projectId, name, evaluatorType, configuration (type-specific parameters), evaluationMode (ONLINE or OFFLINE), target type (Trace or Observation), and scoreConfigId
21.3 WHEN a Deterministic_Evaluator with evaluationMode OFFLINE uses expectedOutput (Exact Match, Levenshtein Distance, Cosine Similarity), THE Agentic_AI_Eval_Platform SHALL only allow it to be assigned to Experiment_Run scoring and SHALL reject assignment to online trace-matching triggers
21.4 WHEN a Deterministic_Evaluator is triggered, THE Job_Scheduler SHALL execute the evaluator logic, compute the score, and submit the resulting Score document with source EVAL
21.5 WHEN the Cosine Similarity evaluator is used, THE Agentic_AI_Eval_Platform SHALL call a configurable embedding model (Amazon Bedrock or OpenAI embeddings) to generate vectors before computing similarity
21.6 THE Agentic_AI_Eval_Platform SHALL allow Deterministic_Evaluators to be assigned to the same online trace-matching triggers and offline Experiment_Run scoring as LLM_Judge evaluators
21.7 WHEN a user configures an Experiment_Run or online trigger, THE OSD_Plugin SHALL allow selecting a mix of LLM_Judge Evaluator_Templates and Deterministic_Evaluators to run together
21.8 THE Deterministic_Evaluator execution SHALL NOT create an execution Trace (unlike LLM_Judge), since deterministic evaluations are lightweight and do not require debugging of evaluator reasoning
Future Enhancement: Support for user-defined custom evaluator functions (Python/TypeScript) that can be uploaded and executed server-side by the Job_Scheduler. This would enable arbitrary scoring logic beyond the built-in evaluator types. Deferred due to security implications of running user-provided code.
Requirement 22: RAG Evaluation via Ragas Framework
User Story: As an Evaluation Engineer building RAG applications, I want built-in RAG evaluation metrics based on the Ragas framework so that I can assess retrieval quality, answer faithfulness, and context relevance without implementing custom evaluators.
Acceptance Criteria
22.1 THE Agentic_AI_Eval_Platform SHALL extract RAG_Context from Traces by identifying retrieval-type Observations (type = "retrieval" or a configurable observation name filter) and collecting their output fields as the retrieved context documents
22.2 THE Agentic_AI_Eval_Platform SHALL provide the following built-in RAG evaluation metrics, each implemented as a pre-configured Evaluator_Template using LLM-as-a-Judge:
Metric
Score Type
Inputs
evaluationMode
Description
Faithfulness
NUMERIC (0.0–1.0)
answer, contexts
ONLINE or OFFLINE
Measures whether claims in the answer are supported by the retrieved contexts
Answer Relevancy
NUMERIC (0.0–1.0)
question, answer
ONLINE or OFFLINE
Measures whether the answer addresses the original question
Context Precision
NUMERIC (0.0–1.0)
question, contexts, ground_truth
OFFLINE only
Measures whether relevant contexts are ranked higher than irrelevant ones
Context Recall
NUMERIC (0.0–1.0)
contexts, ground_truth
OFFLINE only
Measures whether the retrieved contexts cover the information in the ground truth
Context Relevancy
NUMERIC (0.0–1.0)
question, contexts
ONLINE or OFFLINE
Measures whether the retrieved contexts are relevant to the question
Answer Correctness
NUMERIC (0.0–1.0)
answer, ground_truth
OFFLINE only
Measures factual overlap between the answer and the ground truth
Answer Similarity
NUMERIC (0.0–1.0)
answer, ground_truth
OFFLINE only
Measures semantic similarity between the answer and the ground truth using embeddings
22.3 WHEN a user selects a RAG evaluation metric from the OSD_Plugin, THE OSD_Plugin SHALL display the metric's required inputs and automatically map them from the Trace structure: question from trace input, answer from trace output, contexts from RAG_Context observations, and ground_truth from Experiment expectedOutput (offline only)
22.4 THE Agentic_AI_Eval_Platform SHALL expose additional template variables {{contexts}} and {{question}} for RAG Evaluator_Templates, where {{contexts}} is the concatenated or structured list of RAG_Context documents and {{question}} is the trace input
22.5 WHEN a RAG metric requires ground_truth (Context Precision, Context Recall, Answer Correctness, Answer Similarity), THE Agentic_AI_Eval_Platform SHALL enforce evaluationMode OFFLINE and reject assignment to online trace-matching triggers
22.6 WHEN a RAG metric uses only question, answer, and contexts (Faithfulness, Answer Relevancy, Context Relevancy), THE Agentic_AI_Eval_Platform SHALL allow both ONLINE and OFFLINE evaluationMode
22.7 THE Agentic_AI_Eval_Platform SHALL allow users to customize the underlying LLM prompt for each RAG metric while preserving the metric's scoring logic and output schema
22.8 WHEN the Answer Similarity metric is used, THE Agentic_AI_Eval_Platform SHALL compute similarity using a configurable embedding model (Amazon Bedrock or OpenAI embeddings) rather than an LLM judge call
22.9 WHEN a user runs a RAG evaluation suite, THE OSD_Plugin SHALL allow selecting multiple RAG metrics to execute together and display results in a unified RAG evaluation dashboard showing all metric scores per trace or Experiment_Run_Item
22.10 THE Instrumentation_Library SHALL provide a helper method to tag retrieval observations with type "retrieval" and structure the output as a list of context documents, ensuring RAG_Context extraction works correctly
22.11 WHEN RAG evaluation scores are displayed, THE OSD_Plugin SHALL show the extracted contexts alongside the scores so users can understand why a particular faithfulness or relevancy score was assigned
Requirements
Primary Issue : #2588
Requirement 1: Trace and Observation Ingestion via OTel Collector
User Story: As an Application Developer, I want to send trace and observation telemetry from my application so that the Agentic_AI_Eval_Platform captures and stores all execution data for evaluation.
Acceptance Criteria
Requirement 2: OpenSearch Index Design for Traces and Observations
User Story: As a Platform Administrator, I want traces and observations stored in well-structured OpenSearch indices so that queries, aggregations, and analytics perform efficiently.
Acceptance Criteria
Requirement 3: Score Storage and Score Config Management
User Story: As an Evaluation Engineer, I want to create score configurations and attach scores to traces, observations, sessions, or experiment runs so that evaluation results are structured and queryable.
Acceptance Criteria
Requirement 4: Eval Set Management
User Story: As an Evaluation Engineer, I want to create and manage eval sets of input/expected-output pairs so that I can run offline experiments against my LLM application.
Acceptance Criteria
Requirement 5: Experiment Runs
User Story: As an Evaluation Engineer, I want to run my LLM application against an eval set and record the results so that I can evaluate application performance across test cases.
Acceptance Criteria
Requirement 6: Experiment Execution via SDK
User Story: As an Application Developer, I want to programmatically run experiments from my Python or TypeScript code so that I can integrate evaluation into my CI/CD pipeline.
Acceptance Criteria
Requirement 7: Experiment Execution via UI
User Story: As an Evaluation Engineer, I want to run experiments from the OpenSearch Dashboards UI so that I can evaluate prompt versions without writing code.
Acceptance Criteria
Requirement 8: LLM-as-a-Judge Automated Evaluation
User Story: As an Evaluation Engineer, I want to configure automated LLM-based evaluators so that traces and observations are scored without manual intervention.
Acceptance Criteria
{{input}},{{output}}, and{{context}}in the prompt template and SHALL NOT allow{{expectedOutput}}{{input}},{{output}},{{expectedOutput}}, and{{context}}in the prompt templateRequirement 9: Annotation Queues for Human Evaluation
User Story: As an Evaluation Engineer, I want to create annotation queues and assign reviewers so that domain experts can manually evaluate traces and observations.
Acceptance Criteria
Requirement 10: Custom Scores via SDK and API
User Story: As an Application Developer, I want to submit custom scores programmatically so that I can integrate user feedback, guardrail results, and custom evaluation pipelines.
Acceptance Criteria
Requirement 11: Score Analytics and Comparison
User Story: As an Evaluation Engineer, I want to analyze and compare scores across different evaluators, models, and time periods so that I can understand evaluation quality and trends.
Acceptance Criteria
Requirement 12: Dashboards and Monitoring
User Story: As a Platform Administrator, I want configurable dashboards so that I can monitor LLM application performance, cost, and quality metrics.
Acceptance Criteria
Requirement 13: Python Instrumentation Library
User Story: As an Application Developer, I want a Python instrumentation library so that I can trace LLM calls, create eval sets, run experiments, and submit scores from my Python application.
Acceptance Criteria
Requirement 14: TypeScript Instrumentation Library
User Story: As an Application Developer, I want a TypeScript instrumentation library so that I can trace LLM calls, create eval sets, run experiments, and submit scores from my TypeScript application.
Acceptance Criteria
Requirement 15: Trace and Observation Browsing
User Story: As an Application Developer, I want to browse and search traces and observations in the UI so that I can debug and understand my LLM application behavior.
Acceptance Criteria
Requirement 16: Session Management
User Story: As an Application Developer, I want to group related traces into sessions so that I can analyze multi-turn conversations and user journeys.
Acceptance Criteria
Requirement 17: OpenSearch Dashboards Plugin Architecture
User Story: As a Platform Administrator, I want the evaluation UI built as proper OpenSearch Dashboards plugins so that the platform integrates natively with the OpenSearch ecosystem.
Acceptance Criteria
Requirement 18: Async Processing via OpenSearch Job Scheduler
User Story: As a Platform Administrator, I want asynchronous tasks like LLM-as-a-Judge evaluations to execute reliably via the OpenSearch Job Scheduler so that both online and offline evaluation processing integrates natively with the OpenSearch ecosystem without a custom worker service.
Acceptance Criteria
Requirement 19: Single-Trace Visualizations (Trace Map and Debug Timeline)
User Story: As an Application Developer debugging a complex agentic flow, I want graph and timeline visualizations of a single trace so that I can understand execution structure, parallel branches, and performance bottlenecks.
Acceptance Criteria
Requirement 20: Multi-Trace Analytics Visualizations (Agent Map and Agent Path)
User Story: As an Evaluation Engineer analyzing agentic application behavior at scale, I want aggregate topology and path visualizations across multiple traces so that I can understand system architecture, common execution patterns, and failure paths.
Acceptance Criteria
Requirement 21: Built-in Deterministic Evaluators
User Story: As an Evaluation Engineer, I want to configure deterministic evaluators from the UI so that I can score traces and experiment results using programmatic criteria without writing custom code or consuming LLM tokens.
Acceptance Criteria
Requirement 22: RAG Evaluation via Ragas Framework
User Story: As an Evaluation Engineer building RAG applications, I want built-in RAG evaluation metrics based on the Ragas framework so that I can assess retrieval quality, answer faithfulness, and context relevance without implementing custom evaluators.
Acceptance Criteria
{{contexts}}and{{question}}for RAG Evaluator_Templates, where{{contexts}}is the concatenated or structured list of RAG_Context documents and{{question}}is the trace input