Evaluate standalone agents

I'd like to be able to build a standalone agent using the tau2 domain tools, then evaluate the standalone agent using tau2.

A few changes I can think of that would be involved:
1. Separate the tools out into something like an MCP server.
1. Remote agent invocation: Similar to https://github.com/sierra-research/tau2-bench/issues/111, it could be invoked over A2A.
1. Offline evaluation: Ability to convert the remote agent's state/messages into a format that tau2 can evaluate after a run.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate standalone agents #150

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Evaluate standalone agents #150

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions