-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
I want to be able to pass in the tool calls into the LLM as a judge
Sometimes, the output is only correct if a specific tool is used.
Description
Current State
Your evaluation system currently:
- Runs the agent via replay_message_function
- Captures the agent's text response
- Passes only the text to LLM judges for evaluation
What's Already Available
The good news is that tool calls are already being captured in the span tree when you use dataset.evaluate(). The EvaluatorContext passed to evaluators contains ctx.span_tree with all
tool call information.
Modifications Needed
To pass tool calls to the LLM judges, you need to:
- Create a custom evaluator that extracts tool calls from the span tree and formats them for the LLM judge
- Modify the LLM judge rubrics to include tool call evaluation criteria
- Update the judge creation to use both the agent output and tool call information
Here's the implementation plan:
- Create a Tool-Aware LLM Judge Wrapper
# In lib/judges.py, add:
from dataclasses import dataclass
from pydantic_evals.evaluators import Evaluator, EvaluatorContext, EvaluationReason
from pydantic_evals.otel.span_tree import SpanQuery
@dataclass
class ToolAwareLLMJudge(Evaluator):
"""LLM Judge that includes tool call information in its evaluation."""
base_judge: LLMJudge
include_tool_details: bool = True
async def evaluate(self, ctx: EvaluatorContext) -> EvaluationReason:
# Extract tool calls from span tree
tool_calls = []
try:
tool_spans = ctx.span_tree.find(
SpanQuery(name_equals='running tool')
)
for span in tool_spans:
tool_info = {
'name': span.attributes.get('gen_ai.tool.name', 'unknown'),
'parameters': span.attributes.get('gen_ai.tool.parameters', {}),
'duration': span.duration.total_seconds()
}
tool_calls.append(tool_info)
except Exception:
# If span tree not available, proceed without tool info
pass
# Format tool call information
tool_summary = self._format_tool_calls(tool_calls)
# Create enhanced output for the judge
enhanced_output = f"""
Agent Response:
{ctx.output}
Tool Calls Made:
{tool_summary if tool_summary else "No tools were called"}
"""
# Create a modified context with enhanced output
modified_ctx = EvaluatorContext(
name=ctx.name,
inputs=ctx.inputs,
metadata=ctx.metadata,
expected_output=ctx.expected_output,
output=enhanced_output,
duration=ctx.duration,
_span_tree=ctx._span_tree,
attributes=ctx.attributes,
metrics=ctx.metrics
)
# Run the base judge with enhanced context
return await self.base_judge.evaluate(modified_ctx)
def _format_tool_calls(self, tool_calls: list[dict]) -> str:
if not tool_calls:
return ""
lines = []
for i, call in enumerate(tool_calls, 1):
lines.append(f"{i}. Tool: {call['name']}")
if self.include_tool_details and call.get('parameters'):
lines.append(f" Parameters: {call['parameters']}")
lines.append(f" Duration: {call['duration']:.3f}s")
return "\n".join(lines)- Update Judge Creation Functions
# Modify lib/judges.py functions:
def create_tool_aware_primary_judge() -> ToolAwareLLMJudge:
"""Create a tool-aware primary judge."""
base_judge = LLMJudge(
rubric="""
You are evaluating an AI agent's response to a real estate client's message.
Rate the response based on the specific rubric provided for this message category.
IMPORTANT: You are also provided with information about any tools the agent called.
Consider:
- Appropriateness of the response for the given message type
- Whether appropriate tools were used (if any were needed)
- Professional communication quality
- Helpfulness and actionability
- Client service excellence
- Accuracy and relevance
Tool usage expectations:
- The agent should use tools when they would help provide better service
- Tool calls should be relevant to the user's request
- The agent should handle tool results appropriately in the response
Provide a score from 1-5 and explain your reasoning based on both the response content and tool usage.
""",
model="anthropic:claude-4-sonnet-20250514",
)
return ToolAwareLLMJudge(base_judge=base_judge)- Update Dataset Builder
# In lib/dataset_builder.py, modify get_judges_for_case:
def get_judges_for_case(category: str, include_tools: bool = True) -> list[Evaluator]:
"""Get all relevant judges for a specific case category."""
if include_tools:
judges = [
create_tool_aware_primary_judge(),
# Add other tool-aware judges...
]
else:
# Fallback to original judges
judges = [
create_primary_judge(),
create_professionalism_judge(),
create_helpfulness_judge(),
create_accuracy_judge(),
]
# Add category-specific judge if available
category_judges = create_category_specific_judges()
if category in category_judges:
if include_tools:
judges.append(ToolAwareLLMJudge(base_judge=category_judges[category]))
else:
judges.append(category_judges[category])
return judges- Add Tool Call Validation Evaluators
# In lib/judges.py, add specific tool call evaluators:
from pydantic_evals.evaluators.common import HasMatchingSpan
def create_tool_call_validators(expected_tools: list[str]) -> list[Evaluator]:
"""Create evaluators that check for specific tool calls."""
validators = []
for tool_name in expected_tools:
validators.append(
HasMatchingSpan(
query=SpanQuery(
name_equals='running tool',
has_attributes={'gen_ai.tool.name': tool_name}
),
evaluation_name=f"called_{tool_name}"
)
)
return validators
# Use in dataset builder for specific categories:
def get_judges_for_case(category: str) -> list[Evaluator]:
judges = [create_tool_aware_primary_judge()]
# Add tool validators for categories that should use specific tools
if category == "direct_tour_requests":
judges.extend(create_tool_call_validators([
'schedule_tour',
'check_availability',
'get_property_details'
]))
return judgesBenefits of This Approach
- No changes needed to core evaluation flow - Tool calls are already captured
- LLM judges get full context - Both response text and tool usage
- Flexible validation - Can check for specific tools or just evaluate overall usage
- Backward compatible - Can still run without tool information if needed
Summary
The key insight is that tool calls are already being captured in the span tree during evaluation. You just need to:
- Extract them in a custom evaluator wrapper
- Format them for the LLM judge
- Update rubrics to consider tool usage
This gives your LLM judges full visibility into both what the agent said AND what tools it used, enabling much more comprehensive evaluation of agent behavior.
References
No response