Skip to content

Evaluate standalone agents #150

@clareliguori

Description

@clareliguori

I'd like to be able to build a standalone agent using the tau2 domain tools, then evaluate the standalone agent using tau2.

A few changes I can think of that would be involved:

  1. Separate the tools out into something like an MCP server.
  2. Remote agent invocation: Similar to Improving A2A Agent Integration for tau2-bench #111, it could be invoked over A2A.
  3. Offline evaluation: Ability to convert the remote agent's state/messages into a format that tau2 can evaluate after a run.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions