[Design Proposal] Agent Manager Evaluations and AMP-Eval SDK #199

nadheesh · 2026-01-21T11:46:56Z

nadheesh
Jan 21, 2026
Collaborator

Problem

AI agents are inherently non-deterministic—running the same prompt can produce different outputs, tools, and reasoning paths. This makes continuous evaluation essential for maintaining quality, detecting regressions, and building trust in production systems. Yet teams face three core challenges:

1. Evaluation Pipelines are Hard to Build and Manage

Setting up production-grade evaluation requires fetching observability data efficiently and running evaluators over it.

High Engineering Effort: Building production-grade pipelines to monitor agents requires substantial effort to solve data access, efficient batch processing, and error handling—diverting focus from core agent development.
Misplaced Focus: Even experienced teams spend weeks building infrastructure “plumbing” instead of writing evaluation logic.
Quality Blind Spots: Small teams often abandon automation and rely on manual spot-checks, leaving production systems unmonitored and risky.

2. Developer Workflows Break Down

Current approaches—like fully UI-configurable evaluators or evaluators embedded inside agent frameworks—violate standard software engineering practices.

Configuration Lock-in: UI-driven or platform-specific evals can’t be version-controlled, peer-reviewed, or audited via Git.
No Local Parity: Developers can’t run the exact same evaluation locally, if used UI configured evaluators, leading to “deploy and pray” cycles.
Risk of Framework Coupling: Embedding evaluation logic from the agent framework directly in the agent’s implementation can degrade performance and propagate failures from the evaluator to the production agent.

3. Operational Misalignment

A gap exists between the developers who understand what quality metrics are important for a specific agent and the teams responsible for monitoring those metrics in production.

Knowledge Asymmetry: Developers know which behaviors to track and how to judge correctness, but platform owners managing SLAs often lack this context.
Integration Bottleneck: Even when developers build the logic, there is no standardized way to deploy it into production without manual, one-off work.

Existing Solutions

After anyzling major platforms and frameworks following findings are collected. Each has strengths, but all fall short in addressing our three core problems: pipeline complexity, developer workflow, and operational misalignment.

Arize Phoenix

Open-source observability and evaluation platform for LLM applications. Allows creating evaluators and Provides UI for viewing traces and evaluation results. Allows to type of evaluations (online & offline)

Approach:

Using console to setup evaluators
a. LLM-as-judge can be configured by modifying prompts
b. Python scripts with limited dependancies can be loaded as evaluators
Python SDK for defining and running evaluators with the traces collected

Links:

LangSmith

Cloud platform for LLM application monitoring and evaluation (by LangChain)

Approach:

Define evaluators in Python using LangSmith SDK
Code evaluators can access limited set of dependancies and similar to Arize)
Upload datasets to LangSmith cloud
Run evaluations via API or UI
Built-in evaluators for RAG, agents, hallucination detection

Links:

OpenLIT

Open-source observability tool for LLM applications with built-in evaluation

Approach:

Allows LLM-as-judge evaluators that runs periodically

Links:

Google Vertex AI Eval

Managed evaluation service for generative AI applications

Approach:

Submit evaluation requests via Vertex AI API
Use pre-built metrics (groundedness, fluency, safety, etc.)
Provide an SDK and specifications to store datasets to perform evaluations
Results viewable in Vertex AI console

Links:

Proposed Solution

User Stories

Story 1: As a user, I need to monitor agent traces continuously to track performance trends and detect production issues in real time.
Story 2: As a user, I need to test my agent against a specific set of tasks to ensure it meets my success criteria before I deploy.
Story 3: As a user, I need to define agent-specific evaluators (using code or templates) to ensure the evaluation reflects my specific business logic.
Story 4: As a user, I need to create, version, and manage "gold" datasets to maintain a consistent standard for benchmarking.

(Out of current scope)

Story 5: As a user, I need to automate evaluations so that every update to the agent is automatically tested against a dataset to ensure quality remains consistent.
Story 6: As a user, I need to enable the annotation of traces and raw data to generate high-quality datasets for future experimentation.

Concepts

Agent: The AI application or system being evaluated. Evaluating "an agent" often means evaluating both the model and the code (the code that orchestrates it) working together.
Task: Task: A single test or "problem" with a defined input and a success criteria, that needs to be tested.
Trial: A single attempt of a task. Multiple trials are often run per task to account for model non-determinism.
Trajectory: The step-by-step record of a trial. It includes model reasoning, tool calls, intermediate results, and outputs.
Outcome: The final state of the environment or "external reality" at the end of a trial (e.g., whether a booking was actually placed).
Evaluation: The systematic process of measuring and scoring an agent's performance to ensure quality and reliability.
- Experiment: A controlled test that measures performance against a fixed dataset to validate accuracy before deployment.
- Monitor: A periodic or retrospective scan of real user traces to track performance and detect issues in production (covering both future and past traces).
Evaluators: Discrete logic or functions used to assess a specific aspect of an agent’s performance, such as accuracy, safety, or tone.
- Pre-built Evaluators: Ready-to-use, standardized evaluators (like Toxicity or Latency) for instant quality checking without writing code.
- Code-based Evaluator: Custom grading functions defined via Python or LLM prompts for specialized logic unique to your agent.
Aggregator: Mathematical functions used to summarize scores from multiple datapoints (tasks) and trials.
- Task level: Mean, P99, or Success Rate.
- Trial level: pass@k (likelihood of success in $k$ tries) and pass^k (consistency across all $k$ tries).
Dataset: A curated collection of tasks and their success criteria used as the "gold standard" for benchmarking agents.
Labelling: The process of manually or automatically annotating traces to create high-quality datasets for agent experimentation.

What is Provided

We provide,

Python SDK to build evaluations by analysing traces.
Ability to run and manage evaluations in the platform.

`amp-evaluation` SDK

A Python framework for writing portable evaluation logic that runs consistently across local development, CI/CD pipelines, and production monitoring.

Problem it solves:

Traditional evaluation workflows require maintaining separate codebases for local testing (CSV files, mock data) vs production monitoring (API clients, database queries). This creates:

Drift between dev and prod evaluation logic
Complex boilerplate for data loading and environment detection
Difficult debugging when evaluations fail in production

Architecture:

Unified Execution Model: Single Evaluator interface that receives structured trace data regardless of source. The SDK abstracts data fetching—evaluators are pure functions that score traces, not data loaders.

Pluggable Data Sources: Transparent switching between file-based traces (local development) and platform API (production) via configuration. No code changes needed.

# Same evaluator code runs in both modes
@register("answer-quality")
def evaluate(context: EvalContext) -> EvalResult:
    return score(context.trace)  # Data is provided through EvalContext

@register(name="response-quality")
def evaluate_response(ctx: EvalContext) -> EvalResult:
    if ctx.is_experiment:
        # Experiment mode: Compare against expected output
        return EvalResult(
            score=1.0 if ctx.output == ctx.expected_output else 0.0
        )
    else:
        # Monitor mode: Check heuristics (no ground truth available)
        has_error = "error" in ctx.output.lower()
        return EvalResult(score=0.0 if has_error else 1.0)

Framework Interoperability: Wrap third-party evaluators (Ragas, DeepEval, etc.). Compose external metrics with custom logic without vendor lock-in.

@register("ragas-faithfulness")
def faithfulness_wrapper(context: EvalContext):
    # Gather required info from EvalContext
    result = ragas.faithfulness.score({...})
    return {
        "score": result.score,
        "explanation": result.explanation
    }

Local-First Development: CLI runs evaluations against JSON trace files without platform dependencies. Enables fast iteration, unit testing, and CI validation before deployment.
Built-in Agent Evaluator Library: Standard evaluators (exact match, semantic similarity, latency checks) included. Reduces bootstrap time and provides reference implementations.

Key benefit: Developers write evaluation logic once, test it locally with fast feedback loops, then deploy the same code to production monitoring—eliminating environment-specific variations and maintenance overhead.

The SDK is the tool developers use to write the evaluation logic. It’s designed to be lightweight and environment-aware.

Evaluation Support at Platform

The platform supports the following abstractions:

1) Evaluators
Evaluators define the logic to assess specific aspects of an agent's performance (e.g. hallucination detection, tool trajectory validation). The platform provides a set of pre-built evaluators commonly used across scenarios, as well as the ability to create code-based evaluators using the amp-evaluation SDK, tailored to the specific needs of your agents.

2) Datasets
Users can upload and manage datasets that define a set of tasks and success criteria for an agent. Datasets can be uploaded in .json format (see Dataset JSON Schema) or .csv.

3) Monitors
Monitors are workers that track live agents and assess their performance in real time. Users can create monitors using either pre-built evaluators or custom code-based evaluators.

4) Experiments
Experiments allow users to run evaluations on agents using a predefined set of tasks and compare results against success criteria. Users can run experiments with datasets and leverage multiple pre-built or custom code-based evaluators to assess agent performance.

User Experience

Story 1: Continuously monitor agents to track production performance

Navigate to Evaluations → Monitors.

This view lists all existing monitors.

Click Create to add a new monitor.

After clicking Create, select the evaluator type to use for the monitor.
You can either use built-in evaluators or define custom evaluators using code.

See:

Add Pre-built Evaluators
Add Code-based Evaluators

Once the evaluators are loaded, you can complete creating the monitor.

View monitor results

Click on an existing monitor to see its evaluation results.

You can also click View Traces to analyze results against traces and spans.

Story 2: Evaluate agents's performance against a specific set of tasks

Navigate to Evaluations → Experiments.

This view lists all existing experiments.

Click Create to add a new experiment

First, select the dataset for the experiment to define the set of tasks.
Next, choose the evaluator type to use during the experiment. You can either use built-in evaluators or define custom evaluators using code:

See:

Add Pre-built Evaluators
Add Code-based Evaluators

Once the evaluators are loaded, you can complete creating the experiment.

View experiment results

Click on an existing experiment to see its evaluation results.

This view shows all experiment runs. Select a specific run to analyze its results.

---

Add Pre-built Evaluators

(For both Experiments and Monitors)

To configure a pre-built evaluator, use the right panel that appears after clicking an evaluator to add it.

Add Code-based Evaluators

(For both Experiments and Monitors)

Note: UI mocks for Assets are WIP

Architecture

Originally Planned Architecture

Monitor Creation Flow:

The user triggers monitor creation through the Agent Manager console. We planned to have parameterized templates for CronWorkflow and ClusterWorkflow applied to the OpenChoreo cluster during the Agent Manager installation. The Agent Manager service would then invoke the OpenChoreo service to create a WorkflowRun referencing those templates, which would schedule the monitor workflow. All data and configs would be stored through OpenChoreo, eliminating the need to store monitor data in our own database. Only monitor results would be pushed to the database by the monitors after each run.

Challenge:

OpenChoreo doesn't support cron-workflows yet. As an alternative, we attempted to use Argo CronWorkflows directly. However, this introduced a new problem: when the WorkflowRun is executed, it creates an Argo CronWorkflow in the build plane. That CronWorkflow's scheduler then creates the actual monitor workflows, which are not tracked by OpenChoreo. As a result, we would be unable to fetch the status, logs, or any data related to those workflows unless we accessed them through the Kubernetes API directly. This is also not a viable solution because the WSO2 Cloud team has made a design decision that we cannot directly access the Kubernetes API of another cluster from the control plane—all access must go through OpenChoreo.

Revised Architecture

After discussing the constraints with the OpenChoreo team, we decided to go with the above revised architecture until OpenChoreo supports creating generic scheduled workflows.

Key Changes:

Generic Workflows instead of CronWorkflows: We will use OpenChoreo generic workflows, which are directly triggered and executed when a WorkflowRun is applied—no cron scheduling at the Argo level.
Internal Scheduler in Agent Manager: A scheduler job running within the Agent Manager service will be responsible for:
1. Determining which monitors are due to be scheduled.
2. Applying WorkflowRun resources through the OpenChoreo service to trigger the monitor workflows.
3. Fetching workflow status and updating monitor details accordingly.
Database for Monitor State: Since scheduling and orchestration are now managed by the Agent Manager service rather than OpenChoreo's cron mechanism, monitor metadata and state are stored in our PostgreSQL database.

This approach keeps all workflow execution and observability within OpenChoreo's supported capabilities while giving us full control over scheduling and status tracking.

Appendix

Dataset JSON Schema

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "$id": "https://wso2.com/schemas/agent-evaluation/dataset/v1.0",
  "title": "WSO2 Agent Evaluation Dataset Schema",
  "description": "Schema for defining evaluation datasets for AI agents. Supports both simple Q&A tasks and complex multi-step agent evaluations with trajectories, outcomes, and constraints.",
  "type": "object",
  "required": ["name", "tasks"],
  "properties": {
    "name": {
      "type": "string",
      "description": "Human-readable name of the dataset. Used for identification and reporting.",
      "minLength": 1
    },
    "description": {
      "type": "string",
      "description": "Detailed description of the dataset's purpose, scope, and evaluation goals."
    },
    "version": {
      "type": "string",
      "description": "Semantic version of the dataset (e.g., '1.0.0'). Used to track dataset iterations and changes.",
      "pattern": "^\\d+\\.\\d+(\\.\\d+)?$"
    },
    "schema_version": {
      "type": "string",
      "description": "Version of this schema specification. Currently '1.0'. Allows for schema evolution.",
      "const": "1.0"
    },
    "metadata": {
      "type": "object",
      "description": "Dataset-level metadata for authorship, classification, and discovery.",
      "properties": {
        "created_by": {
          "type": "string",
          "description": "Author or team that created the dataset."
        },
        "created_at": {
          "type": "string",
          "format": "date-time",
          "description": "ISO 8601 timestamp when the dataset was created."
        },
        "domain": {
          "type": "string",
          "description": "Domain or industry vertical this dataset targets."
        },
        "tags": {
          "type": "array",
          "description": "Searchable tags for categorizing and filtering datasets.",
          "items": { "type": "string" }
        },
        "description": {
          "type": "string",
          "description": "Additional descriptive metadata (distinct from root-level description)."
        }
      },
      "additionalProperties": true
    },
    "defaults": {
      "type": "object",
      "description": "Default constraints and settings applied to all tasks unless overridden at task level.",
      "properties": {
        "max_latency_ms": {
          "type": "number",
          "description": "Default maximum acceptable latency in milliseconds.",
          "minimum": 0
        },
        "max_tokens": {
          "type": "number",
          "description": "Default maximum token budget (input + output).",
          "minimum": 0
        },
        "max_iterations": {
          "type": "number",
          "description": "Default maximum agent iterations/steps.",
          "minimum": 1
        },
        "prohibited_content": {
          "type": "array",
          "description": "Default list of strings/patterns that should NOT appear in agent outputs.",
          "items": { "type": "string" }
        }
      },
      "additionalProperties": false
    },
    "tasks": {
      "type": "array",
      "description": "List of evaluation tasks/test cases. Each task represents one unit of evaluation.",
      "minItems": 1,
      "items": {
        "type": "object",
        "required": ["id", "input"],
        "properties": {
          "id": {
            "type": "string",
            "description": "Unique identifier for the task within this dataset.",
            "minLength": 1
          },
          "name": {
            "type": "string",
            "description": "Short human-readable name for the task."
          },
          "description": {
            "type": "string",
            "description": "Detailed description of what the task tests."
          },
          "input": {
            "type": "string",
            "description": "The user query/prompt that triggers the agent.",
            "minLength": 1
          },
          "reference_output": {
            "type": "string",
            "description": "OPTIONAL: Expected final output/response from the agent."
          },
          "reference_trajectory": {
            "type": "array",
            "description": "OPTIONAL: Expected sequence of tool calls the agent should make.",
            "items": {
              "type": "object",
              "required": ["tool", "args"],
              "properties": {
                "tool": {
                  "type": "string",
                  "description": "Name of the tool/function that should be called."
                },
                "args": {
                  "type": "object",
                  "description": "Expected arguments passed to the tool."
                },
                "expected_output": {
                  "type": "string",
                  "description": "OPTIONAL: Expected output from this specific tool call."
                }
              },
              "additionalProperties": false
            }
          },
          "expected_outcome": {
            "type": "object",
            "description": "OPTIONAL: Expected side effects or state changes in the external environment. Free-form structure depending on outcome validators."
          },
          "success_criteria": {
            "type": "string",
            "description": "OPTIONAL: Human-readable description of what constitutes success."
          },
          "prohibited_content": {
            "type": "array",
            "description": "OPTIONAL: Task-specific list of strings/patterns that should NOT appear in the output.",
            "items": { "type": "string" }
          },
          "constraints": {
            "type": "object",
            "description": "OPTIONAL: Task-specific performance constraints. Overrides dataset defaults for this task only.",
            "properties": {
              "max_latency_ms": { "type": "number", "description": "Maximum acceptable latency.", "minimum": 0 },
              "max_tokens": { "type": "number", "description": "Maximum token budget.", "minimum": 0 },
              "max_iterations": { "type": "number", "description": "Maximum agent iterations.", "minimum": 1 }
            },
            "additionalProperties": false
          },
          "task_type": {
            "type": "string",
            "description": "OPTIONAL: Type/category of the task for grouping and specialized evaluators.",
            "enum": ["general", "qa", "code_gen", "rag", "tool_use", "math", "reasoning", "multi_step"]
          },
          "difficulty": {
            "type": "string",
            "description": "OPTIONAL: Subjective difficulty rating.",
            "enum": ["easy", "medium", "hard", "expert"],
            "default": "medium"
          },
          "domain": {
            "type": "string",
            "description": "OPTIONAL: Task-specific domain (can differ from dataset domain)."
          },
          "tags": {
            "type": "array",
            "description": "OPTIONAL: Task-specific tags for filtering and analysis.",
            "items": { "type": "string" }
          },
          "custom": {
            "type": "object",
            "description": "OPTIONAL: Arbitrary task-specific metadata. Can be empty or include any structure.",
            "additionalProperties": true
          },
          "metadata": {
            "type": "object",
            "description": "OPTIONAL: Task-level metadata for tracking, authorship, and review workflow.",
            "properties": {
              "author": { "type": "string", "description": "Who created this task." },
              "created_at": { "type": "string", "format": "date-time", "description": "When this task was created." },
              "reviewed_by": { "type": "string", "description": "Who reviewed/approved this task." },
              "last_updated": { "type": "string", "format": "date-time", "description": "When this task was last modified." }
            },
            "additionalProperties": true
          }
        },
        "additionalProperties": false
      }
    }
  },
  "additionalProperties": false
}

Out of Scope

What this proposal explicitly does NOT cover:

Real-time evaluation - Evaluations happen asynchronously after traces are created, not during agent execution
Trace collection - Assumes traces are already captured via OpenTelemetry or similar
Agent platform integration - Does not modify how agents execute or communicate
Dataset management - Dataset management is not covered with this proposal
Model fine-tuning - No training or model improvement features
A/B testing framework - Evaluation only, not experiment management
Prompt engineering tools - Focused on evaluation, not prompt optimization

Alternatives Considered

Approach	Trade-offs	Why Not Chosen
Build into Agent Manager directly	✅ Tighter integration ❌ Less flexible for users ❌ Couples evaluation to platform	- Users want to test locally first - Hard to support custom evaluators - Deployment becomes monolithic
Cloud-only service (like LangSmith)	✅ Easier to manage infrastructure ❌ No local development workflow ❌ Vendor lock-in	- Developers need local iteration - Some users have data residency requirements - Reduces flexibility
Webhook-based push model	✅ Real-time evaluation possible ❌ Complex error handling ❌ Scales poorly with many evaluators	- Adds complexity to agent execution - Checkpoint management harder - Network issues cause trace loss
trace_id-based pagination	✅ Simpler conceptually ❌ String comparison doesn't work ❌ No ordering guarantees	- `WHERE trace_id > 'X'` fails with UUIDs - Can't use database indexes - Not a standard pattern
Embed evaluators in trace ingestion	✅ Lower latency ❌ Impacts ingestion throughput ❌ Can't update evaluators without restart	- Ingestion should be fast and simple - Evaluation is CPU-intensive - Tight coupling of concerns

Open Questions

No response

Milestones

Phase	Milestone	Key Tasks & Deliverables	Target Date	Success Metric
1	Evaluator Library & SDK	• Ship Python SDK for evaluators, enabling creation and execution of evaluations. • Local CLI for running evaluations.	30/01/2026	Developers can score agents using the SDK.
2	Evaluation Jobs & Monitoring	• Platform support for scheduled evaluation jobs on production traces. • UI for selecting pre-built evaluators. • Monitor results from evaluation jobs.	06/02/2026	First automated job successfully pulls and scores traces from production, and displays results.
3	Dataset Management and Task-driven Evals	• Upload datasets and use them with evaluation jobs. • Versioning support for datasets. • Golden Set curation tools (human-in-the-loop labeling from traces).	20/02/2026	Ability to use datasets to define agent evaluation tasks.
4	Code Evaluators	• Support for User-Defined Code Evaluators (Python snippets).	27/03/2026	Platform executes custom scripts to validate specific tool-call parameters.
5	Regression Evaluators	• Integrate evaluation as a regression testing step.	10/03/2026	System automatically blocks a deployment if the Regression Suite pass rate drops.

a5anka · 2026-01-26T10:56:13Z

a5anka
Jan 26, 2026
Collaborator

There seem to be some inconsistencies in the UI that make the model a bit confusing:

Menu item is named "Evaluation"
Page title says "Evaluation Jobs"
But the Create dropdown options are "Built-in Evaluator" and "Code Evaluator"

So are we creating Jobs or Evaluators here? The flow then goes through evaluator config → agent/environment selection → schedule, which bundles both concepts together.

If I want to run the same safety evaluator on multiple agents, do I need to go through this full flow each time and duplicate the evaluator config? Or is there a way to define an evaluator once and then create multiple jobs that reference it?

Might be worth clarifying the entity model:

Evaluators: manage the scoring logic independently
Jobs: select an existing evaluator + agent + trigger

1 reply

nadheesh Jan 26, 2026
Collaborator Author

I will improve the UX for clarity and also add definitions for the entity models:

Evaluation Job: A collection of evaluations with a specific trigger. It has access to traces from a specific agent and can optionally include a defined set of tasks (dataset). Yes, it is tied to a specific agent. Currently, we haven’t defined a flow to apply an evaluation job to other agents, as we consider these jobs to be specific to the agent and its use case.

Evaluators: Scoring logic that assesses a specific aspect of an agent’s performance. Typically, multiple evaluators are needed to measure an agent’s overall performance.

a5anka · 2026-01-26T11:16:26Z

a5anka
Jan 26, 2026
Collaborator

I'm trying to understand the persistence model for evaluation results. The proposal shows the API endpoints for submitting results, but I wasn't clear on how they're stored.

Are evaluation results stored as annotations on the trace documents themselves, or in a separate storage linked by trace_id?

Also, are results linked back to the specific evaluation job that produced them?

1 reply

nadheesh Jan 26, 2026
Collaborator Author

It will be stored separately, linked by trace_id.

I’m considering using OpenSearch for storage, but in a different index. This decision hasn’t been finalized yet.

Yes, the results are linked back to the evaluation job. We plan to report results for each evaluation job separately, because even for the same agent, these jobs use different datasets and are triggered differently.

a5anka · 2026-01-26T11:20:13Z

a5anka
Jan 26, 2026
Collaborator

The SDK examples show multiple evaluators registered in a single file using @register("name"). How are evaluator names managed across the system?

For example, what happens if two different files register evaluators with the same name? Is there any namespacing, or are names expected to be globally unique?

Also, can built-in evaluators and custom evaluators be mixed in a single evaluation job?

1 reply

nadheesh Jan 27, 2026
Collaborator Author

Evaluator names will be stored along with the scores. They won't be globally unique, but will be unique for an specific evaluation job. So the combination of evaluation job id + evaluator name will be unique.

Also, can built-in evaluators and custom evaluators be mixed in a single evaluation job?

Conceptually yes, because in-built evaluators are just pre-define code evaluators. But I was thinking to keep the stories simple initially...

UI based pre-built evaluators for quick start
Code based pre-built evaluators + Customer evaluators for advanced user-stories.

nadheesh · 2026-01-28T18:56:03Z

nadheesh
Jan 28, 2026
Collaborator Author

Updated UI/UX

Concepts:

Evaluation: The systematic process of measuring and scoring an agent's performance to ensure quality and reliability.
- Experiment: A controlled test that measures performance against a fixed dataset to validate accuracy before deployment.
- Monitor: A periodic or retrospective scan of real user traces to track performance and detect issues in production (covering both future and past traces).
Evaluators: Discrete logic or functions used to assess a specific aspect of an agent’s performance, such as accuracy, safety, or tone.
- Pre-built Evaluators: Ready-to-use, standardized evaluators (like Toxicity or Latency) for instant quality checking without writing code.
- Code-based Evaluator: Custom grading functions defined via Python or LLM prompts for specialized logic unique to your agent.
Dataset: A curated collection of tasks and their success criteria used as the "gold standard" for benchmarking agents.
Labelling: The process of manually or automatically annotating traces to create high-quality datasets for agent experimentation.

User-stories

Story 1: As a user, I need to monitor agent traces continuously to track performance trends and detect production issues in real time.
Story 2: As a user, I need to test my agent against a specific set of tasks to ensure it meets my success criteria before I deploy.
Story 3: As a user, I need to define agent-specific evaluators (using code or templates) to ensure the evaluation reflects my specific business logic.
Story 4: As a user, I need to create, version, and manage "gold" datasets to maintain a consistent standard for benchmarking.

(Out of current scope)

Story 5: As a user, I need to automate evaluations so that every update to the agent is automatically tested against a dataset to ensure quality remains consistent.
Story 6: As a user, I need to enable the annotation of traces and raw data to generate high-quality datasets for future experimentation.

User Experience

Story 1: Continuously monitor agents to track production performance

Navigate to Evaluations → Monitors.

This view lists all existing monitors.

Click Create to add a new monitor.

After clicking Create, select the evaluator type to use for the monitor.
You can either use built-in evaluators or define custom evaluators using code.

See:

Add Pre-built Evaluators
Add Code-based Evaluators

Once the evaluators are loaded, you can complete creating the monitor.

View monitor results

Click on an existing monitor to see its evaluation results.

You can also click View Traces to analyze results against traces and spans.

Story 2: Evaluate agents's performance against a specific set of tasks

Navigate to Evaluations → Experiments.

This view lists all existing experiments.

Click Create to add a new experiment

First, select the dataset for the experiment to define the set of tasks.
Next, choose the evaluator type to use during the experiment. You can either use built-in evaluators or define custom evaluators using code:

See:

Add Pre-built Evaluators
Add Code-based Evaluators

Once the evaluators are loaded, you can complete creating the experiment.

View experiment results

Click on an existing experiment to see its evaluation results.

This view shows all experiment runs. Select a specific run to analyze its results.

---

Add Pre-built Evaluators

(For both Experiments and Monitors)

To configure a pre-built evaluator, use the right panel that appears after clicking an evaluator to add it.

Add Code-based Evaluators

(For both Experiments and Monitors)

Note: UI mocks for Assets are WIP

4 replies

a5anka Jan 29, 2026
Collaborator

Looking at the definitions, I noticed that Evaluators are defined as "discrete logic or functions" while Pre-built Evaluators are described as "standardized metrics." Since a pre-built evaluator is still an evaluator, shouldn't it also be described as a function/logic rather than a metric? A metric is what gets measured (e.g., a toxicity score), whereas an evaluator is the function that computes it. Would something like "ready-to-use, standardized evaluation functions for instant quality checking without writing code" be more consistent here?

nadheesh Jan 29, 2026
Collaborator Author

+1, yes using the term "metrics" is not accurate. I have corrected that.

a5anka Jan 29, 2026
Collaborator

On the Evaluation → Experiments page, why is there an Agent column? If we're already at the agent level, the page should only list experiments for the selected agent, making that column redundant.

Same observation in the Evaluation --> Monitors page

nadheesh Jan 30, 2026
Collaborator Author

This is fixed now.

rasika2012 · 2026-01-29T05:55:31Z

rasika2012
Jan 29, 2026
Collaborator

Do we have a way to surface the cost of evaluation? Since some evaluators rely on LLMs, it might be a useful metric to show how much cost is incurred when running an evaluation.

1 reply

nadheesh Jan 29, 2026
Collaborator Author

Yes, I was thinking the same. We can retrieve this from the AI Gateway if the LLMs are created through it. However, we won’t be able to surface this information in the upcoming release. Let’s target this for the next milestone(s). #248

aruniw · 2026-02-03T04:01:22Z

aruniw
Feb 3, 2026
Collaborator

A few suggestions and best practices we could adopt when implementing the UI/UX:

Accessibility and Inclusive Design

Recently, we received feedback highlighting the need for our product to better adhere to accessibility standards, specifically the WCAG guidelines Please ensure these guidelines are properly followed during implementation. Oxygen UI design system already provides and enforces several a11y best practices; please verify and handle them correctly. Key guidelines to keep in mind include:

Ensuring sufficient contrast ratios for text and icons
Supporting full keyboard navigation for tables and forms
Providing proper alt text and ARIA labels for screen readers
Using colors thoughtfully, with consideration for color-blind users

Wizard / Stepper User Flow Guidelines

Ensure wizard or stepper based user flows are clearly mapped end to end. Users should always understand where they are in the process and what steps remain.
Provide an explicit and intuitive way for users to navigate back to previous steps without losing their progress, allowing them to review or modify earlier inputs easily.

Guidance

Provide help texts/tooltips on the summary cards. (eg: what a metric means, explain the trend meaning)
Highlight the primary call-to-action (CTA) of each page and maintain consistent placement
Provide clear empty states for lists/tables with actionable guidance; avoid displaying blank list/tables without direction.
Use meaningful, easily recognizable icons that users can understand at a glance. If an icon is not commonly used or may be ambiguous, provide a tooltip for clarity.

0 replies

nadheesh · 2026-02-11T17:15:07Z

nadheesh
Feb 11, 2026
Collaborator Author

Update on Architecture

Originally Planned Architecture

Monitor Creation Flow:

The user triggers monitor creation through the Agent Manager console. We planned to have parameterized templates for CronWorkflow and ClusterWorkflow applied to the OpenChoreo cluster during the Agent Manager installation. The Agent Manager service would then invoke the OpenChoreo service to create a WorkflowRun referencing those templates, which would schedule the monitor workflow. All data and configs would be stored through OpenChoreo, eliminating the need to store monitor data in our own database. Only monitor results would be pushed to the database by the monitors after each run.

Challenge:

OpenChoreo doesn't support cron-workflows yet. As an alternative, we attempted to use Argo CronWorkflows directly. However, this introduced a new problem: when the WorkflowRun is executed, it creates an Argo CronWorkflow in the build plane. That CronWorkflow's scheduler then creates the actual monitor workflows, which are not tracked by OpenChoreo. As a result, we would be unable to fetch the status, logs, or any data related to those workflows unless we accessed them through the Kubernetes API directly. This is also not a viable solution because the WSO2 Cloud team has made a design decision that we cannot directly access the Kubernetes API of another cluster from the control plane—all access must go through OpenChoreo.

Revised Architecture

After discussing the constraints with the OpenChoreo team, we decided to go with the above revised architecture until OpenChoreo supports creating generic scheduled workflows.

Key Changes:

Generic Workflows instead of CronWorkflows: We will use OpenChoreo generic workflows, which are directly triggered and executed when a WorkflowRun is applied—no cron scheduling at the Argo level.
Internal Scheduler in Agent Manager: A scheduler job running within the Agent Manager service will be responsible for:
1. Determining which monitors are due to be scheduled.
2. Applying WorkflowRun resources through the OpenChoreo service to trigger the monitor workflows.
3. Fetching workflow status and updating monitor details accordingly.
Database for Monitor State: Since scheduling and orchestration are now managed by the Agent Manager service rather than OpenChoreo's cron mechanism, monitor metadata and state are stored in our PostgreSQL database.

This approach keeps all workflow execution and observability within OpenChoreo's supported capabilities while giving us full control over scheduling and status tracking.

0 replies

[Design Proposal] Agent Manager Evaluations and AMP-Eval SDK #199

Uh oh!

Uh oh!

nadheesh Jan 21, 2026 Collaborator

Problem

1. Evaluation Pipelines are Hard to Build and Manage

2. Developer Workflows Break Down

3. Operational Misalignment

Existing Solutions

Arize Phoenix

LangSmith

OpenLIT

Google Vertex AI Eval

Proposed Solution

User Stories

Concepts

What is Provided

amp-evaluation SDK

Evaluation Support at Platform

User Experience

Story 1: Continuously monitor agents to track production performance

Story 2: Evaluate agents's performance against a specific set of tasks

Add Pre-built Evaluators

Add Code-based Evaluators

Architecture

Originally Planned Architecture

Revised Architecture

Appendix

Dataset JSON Schema

Out of Scope

Alternatives Considered

Open Questions

Milestones

Replies: 7 comments · 8 replies

Uh oh!

a5anka Jan 26, 2026 Collaborator

Uh oh!

nadheesh Jan 26, 2026 Collaborator Author

Uh oh!

a5anka Jan 26, 2026 Collaborator

Uh oh!

nadheesh Jan 26, 2026 Collaborator Author

Uh oh!

a5anka Jan 26, 2026 Collaborator

Uh oh!

Uh oh!

nadheesh Jan 27, 2026 Collaborator Author

Uh oh!

Uh oh!

nadheesh Jan 28, 2026 Collaborator Author

Updated UI/UX

Concepts:

User-stories

User Experience

Story 1: Continuously monitor agents to track production performance

Story 2: Evaluate agents's performance against a specific set of tasks

Add Pre-built Evaluators

Add Code-based Evaluators

Uh oh!

a5anka Jan 29, 2026 Collaborator

Uh oh!

nadheesh Jan 29, 2026 Collaborator Author

Uh oh!

Uh oh!

a5anka Jan 29, 2026 Collaborator

Uh oh!

nadheesh Jan 30, 2026 Collaborator Author

Uh oh!

rasika2012 Jan 29, 2026 Collaborator

Uh oh!

Uh oh!

nadheesh Jan 29, 2026 Collaborator Author

Uh oh!

Uh oh!

aruniw Feb 3, 2026 Collaborator

Uh oh!

nadheesh Feb 11, 2026 Collaborator Author

Update on Architecture

Originally Planned Architecture

Revised Architecture

nadheesh
Jan 21, 2026
Collaborator

`amp-evaluation` SDK

Replies: 7 comments 8 replies

a5anka
Jan 26, 2026
Collaborator

nadheesh Jan 26, 2026
Collaborator Author

a5anka
Jan 26, 2026
Collaborator

nadheesh Jan 26, 2026
Collaborator Author

a5anka
Jan 26, 2026
Collaborator

nadheesh Jan 27, 2026
Collaborator Author

nadheesh
Jan 28, 2026
Collaborator Author

a5anka Jan 29, 2026
Collaborator

nadheesh Jan 29, 2026
Collaborator Author

a5anka Jan 29, 2026
Collaborator

nadheesh Jan 30, 2026
Collaborator Author

rasika2012
Jan 29, 2026
Collaborator

nadheesh Jan 29, 2026
Collaborator Author

aruniw
Feb 3, 2026
Collaborator

nadheesh
Feb 11, 2026
Collaborator Author