Skip to content

Passing in tools into the Evals LLM judge #2667

@ryx2

Description

@ryx2

I want to be able to pass in the tool calls into the LLM as a judge

Sometimes, the output is only correct if a specific tool is used.

Description

Current State

Your evaluation system currently:

  1. Runs the agent via replay_message_function
  2. Captures the agent's text response
  3. Passes only the text to LLM judges for evaluation

What's Already Available

The good news is that tool calls are already being captured in the span tree when you use dataset.evaluate(). The EvaluatorContext passed to evaluators contains ctx.span_tree with all
tool call information.

Modifications Needed

To pass tool calls to the LLM judges, you need to:

  1. Create a custom evaluator that extracts tool calls from the span tree and formats them for the LLM judge
  2. Modify the LLM judge rubrics to include tool call evaluation criteria
  3. Update the judge creation to use both the agent output and tool call information

Here's the implementation plan:

  1. Create a Tool-Aware LLM Judge Wrapper
  # In lib/judges.py, add:

  from dataclasses import dataclass
  from pydantic_evals.evaluators import Evaluator, EvaluatorContext, EvaluationReason
  from pydantic_evals.otel.span_tree import SpanQuery

  @dataclass
  class ToolAwareLLMJudge(Evaluator):
      """LLM Judge that includes tool call information in its evaluation."""

      base_judge: LLMJudge
      include_tool_details: bool = True

      async def evaluate(self, ctx: EvaluatorContext) -> EvaluationReason:
          # Extract tool calls from span tree
          tool_calls = []
          try:
              tool_spans = ctx.span_tree.find(
                  SpanQuery(name_equals='running tool')
              )
              for span in tool_spans:
                  tool_info = {
                      'name': span.attributes.get('gen_ai.tool.name', 'unknown'),
                      'parameters': span.attributes.get('gen_ai.tool.parameters', {}),
                      'duration': span.duration.total_seconds()
                  }
                  tool_calls.append(tool_info)
          except Exception:
              # If span tree not available, proceed without tool info
              pass

          # Format tool call information
          tool_summary = self._format_tool_calls(tool_calls)

          # Create enhanced output for the judge
          enhanced_output = f"""
  Agent Response:
  {ctx.output}

  Tool Calls Made:
  {tool_summary if tool_summary else "No tools were called"}
  """

          # Create a modified context with enhanced output
          modified_ctx = EvaluatorContext(
              name=ctx.name,
              inputs=ctx.inputs,
              metadata=ctx.metadata,
              expected_output=ctx.expected_output,
              output=enhanced_output,
              duration=ctx.duration,
              _span_tree=ctx._span_tree,
              attributes=ctx.attributes,
              metrics=ctx.metrics
          )

          # Run the base judge with enhanced context
          return await self.base_judge.evaluate(modified_ctx)

      def _format_tool_calls(self, tool_calls: list[dict]) -> str:
          if not tool_calls:
              return ""

          lines = []
          for i, call in enumerate(tool_calls, 1):
              lines.append(f"{i}. Tool: {call['name']}")
              if self.include_tool_details and call.get('parameters'):
                  lines.append(f"   Parameters: {call['parameters']}")
              lines.append(f"   Duration: {call['duration']:.3f}s")

          return "\n".join(lines)
  1. Update Judge Creation Functions
  # Modify lib/judges.py functions:

  def create_tool_aware_primary_judge() -> ToolAwareLLMJudge:
      """Create a tool-aware primary judge."""
      base_judge = LLMJudge(
          rubric="""
  You are evaluating an AI agent's response to a real estate client's message. 
  Rate the response based on the specific rubric provided for this message category.

  IMPORTANT: You are also provided with information about any tools the agent called.
  Consider:
  - Appropriateness of the response for the given message type
  - Whether appropriate tools were used (if any were needed)
  - Professional communication quality
  - Helpfulness and actionability
  - Client service excellence
  - Accuracy and relevance

  Tool usage expectations:
  - The agent should use tools when they would help provide better service
  - Tool calls should be relevant to the user's request
  - The agent should handle tool results appropriately in the response

  Provide a score from 1-5 and explain your reasoning based on both the response content and tool usage.
          """,
          model="anthropic:claude-4-sonnet-20250514",
      )
      return ToolAwareLLMJudge(base_judge=base_judge)
  1. Update Dataset Builder
  # In lib/dataset_builder.py, modify get_judges_for_case:

  def get_judges_for_case(category: str, include_tools: bool = True) -> list[Evaluator]:
      """Get all relevant judges for a specific case category."""
      if include_tools:
          judges = [
              create_tool_aware_primary_judge(),
              # Add other tool-aware judges...
          ]
      else:
          # Fallback to original judges
          judges = [
              create_primary_judge(),
              create_professionalism_judge(),
              create_helpfulness_judge(),
              create_accuracy_judge(),
          ]

      # Add category-specific judge if available
      category_judges = create_category_specific_judges()
      if category in category_judges:
          if include_tools:
              judges.append(ToolAwareLLMJudge(base_judge=category_judges[category]))
          else:
              judges.append(category_judges[category])

      return judges
  1. Add Tool Call Validation Evaluators
  # In lib/judges.py, add specific tool call evaluators:

  from pydantic_evals.evaluators.common import HasMatchingSpan

  def create_tool_call_validators(expected_tools: list[str]) -> list[Evaluator]:
      """Create evaluators that check for specific tool calls."""
      validators = []

      for tool_name in expected_tools:
          validators.append(
              HasMatchingSpan(
                  query=SpanQuery(
                      name_equals='running tool',
                      has_attributes={'gen_ai.tool.name': tool_name}
                  ),
                  evaluation_name=f"called_{tool_name}"
              )
          )

      return validators

  # Use in dataset builder for specific categories:
  def get_judges_for_case(category: str) -> list[Evaluator]:
      judges = [create_tool_aware_primary_judge()]

      # Add tool validators for categories that should use specific tools
      if category == "direct_tour_requests":
          judges.extend(create_tool_call_validators([
              'schedule_tour',
              'check_availability',
              'get_property_details'
          ]))

      return judges

Benefits of This Approach

  1. No changes needed to core evaluation flow - Tool calls are already captured
  2. LLM judges get full context - Both response text and tool usage
  3. Flexible validation - Can check for specific tools or just evaluate overall usage
  4. Backward compatible - Can still run without tool information if needed

Summary

The key insight is that tool calls are already being captured in the span tree during evaluation. You just need to:

  1. Extract them in a custom evaluator wrapper
  2. Format them for the LLM judge
  3. Update rubrics to consider tool usage

This gives your LLM judges full visibility into both what the agent said AND what tools it used, enabling much more comprehensive evaluation of agent behavior.

References

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions