-
Notifications
You must be signed in to change notification settings - Fork 521
Description
Problem Statement
AFAIK we don't have a pattern for: "I'm not sure which approach is best at performing a task beforehand, so try several at the same time and let a judge decide."
The Agent-as-a-Judge paradigm (that builds on top of the more common LLM-as-a-Judge) would fit naturally here.
Proposed Solution
from strands.multiagent import Arena
from strands import Agent
judge_agent = Agent(
system_prompt="""You are a judge. Evaluate the solutions provided and pick the best one.
Use your tools to verify claims, run code, check facts.
Return your verdict with reasoning.""",
tools=[run_code, verify_facts]
)
result = Arena(
agents=[agent_a, agent_b, agent_c],
judge=judge_agent,
).run("Design an API for user authentication")Agents run in parallel and the judge evaluates, making the winner "emerge".
Use Case
I have a few different agent/multi-agent configurations and I want to know which one works best. I'm trying different prompts/ comparing models/ testing whether adding a tool actually helps, etc...I don't know which one will perform better on this task beforehand, so I want to run them all and choose the one who performs better.
The "Judge" agent can verify the outputs and pick the one it considers best. Because it's an agent, it can use tools (and all the agent functionality) to validate results rather than just comparing text.
Alternatives Solutions
No response
Additional Context
Agent-as-a-Judge: Evaluate Agents with Agents
When AIs Judge AIs: The Rise of Agent-as-a-Judge Evaluation for LLMs