[Design Proposal] Agent Manager Evaluations and AMP-Eval SDK #199
Replies: 7 comments 8 replies
-
|
There seem to be some inconsistencies in the UI that make the model a bit confusing:
So are we creating Jobs or Evaluators here? The flow then goes through evaluator config → agent/environment selection → schedule, which bundles both concepts together. If I want to run the same safety evaluator on multiple agents, do I need to go through this full flow each time and duplicate the evaluator config? Or is there a way to define an evaluator once and then create multiple jobs that reference it? Might be worth clarifying the entity model:
|
Beta Was this translation helpful? Give feedback.
-
|
I'm trying to understand the persistence model for evaluation results. The proposal shows the API endpoints for submitting results, but I wasn't clear on how they're stored. Are evaluation results stored as annotations on the trace documents themselves, or in a separate storage linked by Also, are results linked back to the specific evaluation job that produced them? |
Beta Was this translation helpful? Give feedback.
-
|
The SDK examples show multiple evaluators registered in a single file using For example, what happens if two different files register evaluators with the same name? Is there any namespacing, or are names expected to be globally unique? Also, can built-in evaluators and custom evaluators be mixed in a single evaluation job? |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
|
Do we have a way to surface the cost of evaluation? Since some evaluators rely on LLMs, it might be a useful metric to show how much cost is incurred when running an evaluation. |
Beta Was this translation helpful? Give feedback.
-
|
A few suggestions and best practices we could adopt when implementing the UI/UX: Accessibility and Inclusive Design Recently, we received feedback highlighting the need for our product to better adhere to accessibility standards, specifically the WCAG guidelines Please ensure these guidelines are properly followed during implementation. Oxygen UI design system already provides and enforces several a11y best practices; please verify and handle them correctly. Key guidelines to keep in mind include:
Wizard / Stepper User Flow Guidelines
Guidance
|
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.













Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Problem
AI agents are inherently non-deterministic—running the same prompt can produce different outputs, tools, and reasoning paths. This makes continuous evaluation essential for maintaining quality, detecting regressions, and building trust in production systems. Yet teams face three core challenges:
1. Evaluation Pipelines are Hard to Build and Manage
Setting up production-grade evaluation requires fetching observability data efficiently and running evaluators over it.
High Engineering Effort: Building production-grade pipelines to monitor agents requires substantial effort to solve data access, efficient batch processing, and error handling—diverting focus from core agent development.
Misplaced Focus: Even experienced teams spend weeks building infrastructure “plumbing” instead of writing evaluation logic.
Quality Blind Spots: Small teams often abandon automation and rely on manual spot-checks, leaving production systems unmonitored and risky.
2. Developer Workflows Break Down
Current approaches—like fully UI-configurable evaluators or evaluators embedded inside agent frameworks—violate standard software engineering practices.
Configuration Lock-in: UI-driven or platform-specific evals can’t be version-controlled, peer-reviewed, or audited via Git.
No Local Parity: Developers can’t run the exact same evaluation locally, if used UI configured evaluators, leading to “deploy and pray” cycles.
Risk of Framework Coupling: Embedding evaluation logic from the agent framework directly in the agent’s implementation can degrade performance and propagate failures from the evaluator to the production agent.
3. Operational Misalignment
A gap exists between the developers who understand what quality metrics are important for a specific agent and the teams responsible for monitoring those metrics in production.
Knowledge Asymmetry: Developers know which behaviors to track and how to judge correctness, but platform owners managing SLAs often lack this context.
Integration Bottleneck: Even when developers build the logic, there is no standardized way to deploy it into production without manual, one-off work.
Existing Solutions
After anyzling major platforms and frameworks following findings are collected. Each has strengths, but all fall short in addressing our three core problems: pipeline complexity, developer workflow, and operational misalignment.
Arize Phoenix
Open-source observability and evaluation platform for LLM applications. Allows creating evaluators and Provides UI for viewing traces and evaluation results. Allows to type of evaluations (online & offline)
Approach:
a. LLM-as-judge can be configured by modifying prompts
b. Python scripts with limited dependancies can be loaded as evaluators
Links:
LangSmith
Cloud platform for LLM application monitoring and evaluation (by LangChain)
Approach:
Links:
OpenLIT
Open-source observability tool for LLM applications with built-in evaluation
Approach:
Links:
Google Vertex AI Eval
Managed evaluation service for generative AI applications
Approach:
Links:
Proposed Solution
User Stories
Story 1: As a user, I need to monitor agent traces continuously to track performance trends and detect production issues in real time.
Story 2: As a user, I need to test my agent against a specific set of tasks to ensure it meets my success criteria before I deploy.
Story 3: As a user, I need to define agent-specific evaluators (using code or templates) to ensure the evaluation reflects my specific business logic.
Story 4: As a user, I need to create, version, and manage "gold" datasets to maintain a consistent standard for benchmarking.
(Out of current scope)
Story 5: As a user, I need to automate evaluations so that every update to the agent is automatically tested against a dataset to ensure quality remains consistent.
Story 6: As a user, I need to enable the annotation of traces and raw data to generate high-quality datasets for future experimentation.
Concepts
Agent: The AI application or system being evaluated. Evaluating "an agent" often means evaluating both the model and the code (the code that orchestrates it) working together.
Task: Task: A single test or "problem" with a defined input and a success criteria, that needs to be tested.
Trial: A single attempt of a task. Multiple trials are often run per task to account for model non-determinism.
Trajectory: The step-by-step record of a trial. It includes model reasoning, tool calls, intermediate results, and outputs.
Outcome: The final state of the environment or "external reality" at the end of a trial (e.g., whether a booking was actually placed).
Evaluation: The systematic process of measuring and scoring an agent's performance to ensure quality and reliability.
Evaluators: Discrete logic or functions used to assess a specific aspect of an agent’s performance, such as accuracy, safety, or tone.
Aggregator: Mathematical functions used to summarize scores from multiple datapoints (tasks) and trials.
Dataset: A curated collection of tasks and their success criteria used as the "gold standard" for benchmarking agents.
Labelling: The process of manually or automatically annotating traces to create high-quality datasets for agent experimentation.
What is Provided
We provide,
amp-evaluationSDKA Python framework for writing portable evaluation logic that runs consistently across local development, CI/CD pipelines, and production monitoring.
Problem it solves:
Traditional evaluation workflows require maintaining separate codebases for local testing (CSV files, mock data) vs production monitoring (API clients, database queries). This creates:
Architecture:
Unified Execution Model: Single
Evaluatorinterface that receives structured trace data regardless of source. The SDK abstracts data fetching—evaluators are pure functions that score traces, not data loaders.Pluggable Data Sources: Transparent switching between file-based traces (local development) and platform API (production) via configuration. No code changes needed.
Framework Interoperability: Wrap third-party evaluators (Ragas, DeepEval, etc.). Compose external metrics with custom logic without vendor lock-in.
Local-First Development: CLI runs evaluations against JSON trace files without platform dependencies. Enables fast iteration, unit testing, and CI validation before deployment.
Built-in Agent Evaluator Library: Standard evaluators (exact match, semantic similarity, latency checks) included. Reduces bootstrap time and provides reference implementations.
Key benefit: Developers write evaluation logic once, test it locally with fast feedback loops, then deploy the same code to production monitoring—eliminating environment-specific variations and maintenance overhead.
The SDK is the tool developers use to write the evaluation logic. It’s designed to be lightweight and environment-aware.
Evaluation Support at Platform
The platform supports the following abstractions:
1) Evaluators
Evaluators define the logic to assess specific aspects of an agent's performance (e.g. hallucination detection, tool trajectory validation). The platform provides a set of pre-built evaluators commonly used across scenarios, as well as the ability to create code-based evaluators using the
amp-evaluationSDK, tailored to the specific needs of your agents.2) Datasets
Users can upload and manage datasets that define a set of tasks and success criteria for an agent. Datasets can be uploaded in .json format (see Dataset JSON Schema) or .csv.
3) Monitors
Monitors are workers that track live agents and assess their performance in real time. Users can create monitors using either pre-built evaluators or custom code-based evaluators.
4) Experiments
Experiments allow users to run evaluations on agents using a predefined set of tasks and compare results against success criteria. Users can run experiments with datasets and leverage multiple pre-built or custom code-based evaluators to assess agent performance.
User Experience
Story 1: Continuously monitor agents to track production performance
This view lists all existing monitors.
After clicking Create, select the evaluator type to use for the monitor.
You can either use built-in evaluators or define custom evaluators using code.
See:
Once the evaluators are loaded, you can complete creating the monitor.
Click on an existing monitor to see its evaluation results.
You can also click View Traces to analyze results against traces and spans.
Story 2: Evaluate agents's performance against a specific set of tasks
This view lists all existing experiments.
First, select the dataset for the experiment to define the set of tasks.
Next, choose the evaluator type to use during the experiment. You can either use built-in evaluators or define custom evaluators using code:
See:
Once the evaluators are loaded, you can complete creating the experiment.
Click on an existing experiment to see its evaluation results.
This view shows all experiment runs. Select a specific run to analyze its results.
Add Pre-built Evaluators
(For both Experiments and Monitors)
To configure a pre-built evaluator, use the right panel that appears after clicking an evaluator to add it.
Add Code-based Evaluators
(For both Experiments and Monitors)

Note: UI mocks for Assets are WIP
Architecture
Originally Planned Architecture
Monitor Creation Flow:
The user triggers monitor creation through the Agent Manager console. We planned to have parameterized templates for
CronWorkflowandClusterWorkflowapplied to the OpenChoreo cluster during the Agent Manager installation. The Agent Manager service would then invoke the OpenChoreo service to create aWorkflowRunreferencing those templates, which would schedule the monitor workflow. All data and configs would be stored through OpenChoreo, eliminating the need to store monitor data in our own database. Only monitor results would be pushed to the database by the monitors after each run.Challenge:
OpenChoreo doesn't support cron-workflows yet. As an alternative, we attempted to use Argo CronWorkflows directly. However, this introduced a new problem: when the
WorkflowRunis executed, it creates an Argo CronWorkflow in the build plane. That CronWorkflow's scheduler then creates the actual monitor workflows, which are not tracked by OpenChoreo. As a result, we would be unable to fetch the status, logs, or any data related to those workflows unless we accessed them through the Kubernetes API directly. This is also not a viable solution because the WSO2 Cloud team has made a design decision that we cannot directly access the Kubernetes API of another cluster from the control plane—all access must go through OpenChoreo.Revised Architecture
After discussing the constraints with the OpenChoreo team, we decided to go with the above revised architecture until OpenChoreo supports creating generic scheduled workflows.
Key Changes:
WorkflowRunis applied—no cron scheduling at the Argo level.WorkflowRunresources through the OpenChoreo service to trigger the monitor workflows.This approach keeps all workflow execution and observability within OpenChoreo's supported capabilities while giving us full control over scheduling and status tracking.
Appendix
Dataset JSON Schema
{ "$schema": "http://json-schema.org/draft-07/schema#", "$id": "https://wso2.com/schemas/agent-evaluation/dataset/v1.0", "title": "WSO2 Agent Evaluation Dataset Schema", "description": "Schema for defining evaluation datasets for AI agents. Supports both simple Q&A tasks and complex multi-step agent evaluations with trajectories, outcomes, and constraints.", "type": "object", "required": ["name", "tasks"], "properties": { "name": { "type": "string", "description": "Human-readable name of the dataset. Used for identification and reporting.", "minLength": 1 }, "description": { "type": "string", "description": "Detailed description of the dataset's purpose, scope, and evaluation goals." }, "version": { "type": "string", "description": "Semantic version of the dataset (e.g., '1.0.0'). Used to track dataset iterations and changes.", "pattern": "^\\d+\\.\\d+(\\.\\d+)?$" }, "schema_version": { "type": "string", "description": "Version of this schema specification. Currently '1.0'. Allows for schema evolution.", "const": "1.0" }, "metadata": { "type": "object", "description": "Dataset-level metadata for authorship, classification, and discovery.", "properties": { "created_by": { "type": "string", "description": "Author or team that created the dataset." }, "created_at": { "type": "string", "format": "date-time", "description": "ISO 8601 timestamp when the dataset was created." }, "domain": { "type": "string", "description": "Domain or industry vertical this dataset targets." }, "tags": { "type": "array", "description": "Searchable tags for categorizing and filtering datasets.", "items": { "type": "string" } }, "description": { "type": "string", "description": "Additional descriptive metadata (distinct from root-level description)." } }, "additionalProperties": true }, "defaults": { "type": "object", "description": "Default constraints and settings applied to all tasks unless overridden at task level.", "properties": { "max_latency_ms": { "type": "number", "description": "Default maximum acceptable latency in milliseconds.", "minimum": 0 }, "max_tokens": { "type": "number", "description": "Default maximum token budget (input + output).", "minimum": 0 }, "max_iterations": { "type": "number", "description": "Default maximum agent iterations/steps.", "minimum": 1 }, "prohibited_content": { "type": "array", "description": "Default list of strings/patterns that should NOT appear in agent outputs.", "items": { "type": "string" } } }, "additionalProperties": false }, "tasks": { "type": "array", "description": "List of evaluation tasks/test cases. Each task represents one unit of evaluation.", "minItems": 1, "items": { "type": "object", "required": ["id", "input"], "properties": { "id": { "type": "string", "description": "Unique identifier for the task within this dataset.", "minLength": 1 }, "name": { "type": "string", "description": "Short human-readable name for the task." }, "description": { "type": "string", "description": "Detailed description of what the task tests." }, "input": { "type": "string", "description": "The user query/prompt that triggers the agent.", "minLength": 1 }, "reference_output": { "type": "string", "description": "OPTIONAL: Expected final output/response from the agent." }, "reference_trajectory": { "type": "array", "description": "OPTIONAL: Expected sequence of tool calls the agent should make.", "items": { "type": "object", "required": ["tool", "args"], "properties": { "tool": { "type": "string", "description": "Name of the tool/function that should be called." }, "args": { "type": "object", "description": "Expected arguments passed to the tool." }, "expected_output": { "type": "string", "description": "OPTIONAL: Expected output from this specific tool call." } }, "additionalProperties": false } }, "expected_outcome": { "type": "object", "description": "OPTIONAL: Expected side effects or state changes in the external environment. Free-form structure depending on outcome validators." }, "success_criteria": { "type": "string", "description": "OPTIONAL: Human-readable description of what constitutes success." }, "prohibited_content": { "type": "array", "description": "OPTIONAL: Task-specific list of strings/patterns that should NOT appear in the output.", "items": { "type": "string" } }, "constraints": { "type": "object", "description": "OPTIONAL: Task-specific performance constraints. Overrides dataset defaults for this task only.", "properties": { "max_latency_ms": { "type": "number", "description": "Maximum acceptable latency.", "minimum": 0 }, "max_tokens": { "type": "number", "description": "Maximum token budget.", "minimum": 0 }, "max_iterations": { "type": "number", "description": "Maximum agent iterations.", "minimum": 1 } }, "additionalProperties": false }, "task_type": { "type": "string", "description": "OPTIONAL: Type/category of the task for grouping and specialized evaluators.", "enum": ["general", "qa", "code_gen", "rag", "tool_use", "math", "reasoning", "multi_step"] }, "difficulty": { "type": "string", "description": "OPTIONAL: Subjective difficulty rating.", "enum": ["easy", "medium", "hard", "expert"], "default": "medium" }, "domain": { "type": "string", "description": "OPTIONAL: Task-specific domain (can differ from dataset domain)." }, "tags": { "type": "array", "description": "OPTIONAL: Task-specific tags for filtering and analysis.", "items": { "type": "string" } }, "custom": { "type": "object", "description": "OPTIONAL: Arbitrary task-specific metadata. Can be empty or include any structure.", "additionalProperties": true }, "metadata": { "type": "object", "description": "OPTIONAL: Task-level metadata for tracking, authorship, and review workflow.", "properties": { "author": { "type": "string", "description": "Who created this task." }, "created_at": { "type": "string", "format": "date-time", "description": "When this task was created." }, "reviewed_by": { "type": "string", "description": "Who reviewed/approved this task." }, "last_updated": { "type": "string", "format": "date-time", "description": "When this task was last modified." } }, "additionalProperties": true } }, "additionalProperties": false } } }, "additionalProperties": false }Out of Scope
What this proposal explicitly does NOT cover:
Alternatives Considered
❌ Less flexible for users
❌ Couples evaluation to platform
- Hard to support custom evaluators
- Deployment becomes monolithic
❌ No local development workflow
❌ Vendor lock-in
- Some users have data residency requirements
- Reduces flexibility
❌ Complex error handling
❌ Scales poorly with many evaluators
- Checkpoint management harder
- Network issues cause trace loss
❌ String comparison doesn't work
❌ No ordering guarantees
WHERE trace_id > 'X'fails with UUIDs- Can't use database indexes
- Not a standard pattern
❌ Impacts ingestion throughput
❌ Can't update evaluators without restart
- Evaluation is CPU-intensive
- Tight coupling of concerns
Open Questions
No response
Milestones
• Local CLI for running evaluations.
• UI for selecting pre-built evaluators.
• Monitor results from evaluation jobs.
• Versioning support for datasets.
• Golden Set curation tools (human-in-the-loop labeling from traces).
Beta Was this translation helpful? Give feedback.
All reactions