Evaluate AI system behavior by analyzing OpenTelemetry spans captured during execution.
!!! note "Requires Logfire"
Span-based evaluation requires logfire to be installed and configured:
bash pip install 'pydantic-evals[logfire]'
Span-based evaluation enables you to evaluate how your AI system executes, not just what it produces. This is essential for complex agents where ensuring the desired behavior depends on the execution path taken, not just the final output.
Traditional evaluators assess task inputs and outputs. For simple tasks, this may be sufficient—if the output is correct, the task succeeded. But for complex multi-step agents, the process matters as much as the result:
- A correct answer reached incorrectly - An agent might produce the right output by accident (e.g., guessing, using cached data when it should have searched, calling the wrong tools but getting lucky)
- Verification of required behaviors - You need to ensure specific tools were called, certain code paths executed, or particular patterns followed
- Performance and efficiency - The agent should reach the answer efficiently, without unnecessary tool calls, infinite loops, or excessive retries
- Safety and compliance - Critical to verify that dangerous operations weren't attempted, sensitive data wasn't accessed inappropriately, or guardrails weren't bypassed
Span-based evaluation is particularly valuable for:
- RAG systems - Verify documents were retrieved and reranked before generation, not just that the answer included citations
- Multi-agent coordination - Ensure the orchestrator delegated to the right specialist agents in the correct order
- Tool-calling agents - Confirm specific tools were used (or avoided), and in the expected sequence
- Debugging and regression testing - Catch behavioral regressions where outputs remain correct but the internal logic deteriorates
- Production alignment - Ensure your evaluation assertions operate on the same telemetry data captured in production, so eval insights directly translate to production monitoring
When you configure logfire (logfire.configure()), Pydantic Evals captures all OpenTelemetry spans generated during task execution. You can then write evaluators that assert conditions on:
- Which tools were called -
HasMatchingSpan(query={'name_contains': 'search_tool'}) - Code paths executed - Verify specific functions ran or particular branches taken
- Timing characteristics - Check that operations complete within SLA bounds
- Error conditions - Detect retries, fallbacks, or specific failure modes
- Execution structure - Verify parent-child relationships, delegation patterns, or execution order
This creates a fundamentally different evaluation paradigm: you're testing behavioral contracts, not just input-output relationships.
import logfire
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import HasMatchingSpan
# Configure logfire to capture spans
logfire.configure(send_to_logfire='if-token-present')
dataset = Dataset(
cases=[Case(inputs='test')],
evaluators=[
# Check that database was queried
HasMatchingSpan(
query={'name_contains': 'database_query'},
evaluation_name='used_database',
),
],
)The [HasMatchingSpan][pydantic_evals.evaluators.HasMatchingSpan] evaluator checks if any span matches a query:
from pydantic_evals.evaluators import HasMatchingSpan
HasMatchingSpan(
query={'name_contains': 'test'},
evaluation_name='span_check',
)Returns: bool - True if any span matches the query
A [SpanQuery][pydantic_evals.otel.SpanQuery] is a dictionary with query conditions:
Match spans by name:
# Exact name match
{'name_equals': 'search_database'}
# Contains substring
{'name_contains': 'tool_call'}
# Regex pattern
{'name_matches_regex': r'llm_call_\d+'}Match spans with specific attributes:
# Has specific attribute values
{'has_attributes': {'operation': 'search', 'status': 'success'}}
# Has attribute keys (any value)
{'has_attribute_keys': ['user_id', 'request_id']}Match based on execution time:
from datetime import timedelta
# Minimum duration
{'min_duration': 1.0} # seconds
{'min_duration': timedelta(seconds=1)}
# Maximum duration
{'max_duration': 5.0} # seconds
{'max_duration': timedelta(seconds=5)}
# Range
{'min_duration': 0.5, 'max_duration': 2.0}Combine conditions:
# NOT
{'not_': {'name_contains': 'error'}}
# AND (all must match)
{'and_': [
{'name_contains': 'tool'},
{'max_duration': 1.0},
]}
# OR (any must match)
{'or_': [
{'name_equals': 'search'},
{'name_equals': 'query'},
]}Query relationships between spans:
# Count direct children
{'min_child_count': 1}
{'max_child_count': 5}
# Some child matches query
{'some_child_has': {'name_contains': 'retry'}}
# All children match query
{'all_children_have': {'max_duration': 0.5}}
# No children match query
{'no_child_has': {'has_attributes': {'error': True}}}
# Descendant queries (recursive)
{'min_descendant_count': 5}
{'some_descendant_has': {'name_contains': 'api_call'}}Query span hierarchy:
# Depth (root spans have depth 0)
{'min_depth': 1} # Not a root span
{'max_depth': 2} # At most 2 levels deep
# Ancestor queries
{'some_ancestor_has': {'name_equals': 'agent_run'}}
{'all_ancestors_have': {'max_duration': 10.0}}
{'no_ancestor_has': {'has_attributes': {'error': True}}}Control recursive queries:
{
'some_descendant_has': {'name_contains': 'expensive'},
'stop_recursing_when': {'name_equals': 'boundary'},
}
# Only search descendants until hitting a span named 'boundary'Check that specific tools were called:
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import HasMatchingSpan
dataset = Dataset(
cases=[Case(inputs='test')],
evaluators=[
# Must call search tool
HasMatchingSpan(
query={'name_contains': 'search_tool'},
evaluation_name='used_search',
),
# Must NOT call dangerous tool
HasMatchingSpan(
query={'not_': {'name_contains': 'delete_database'}},
evaluation_name='safe_execution',
),
],
)Verify a sequence of operations:
from pydantic_evals.evaluators import HasMatchingSpan
evaluators = [
HasMatchingSpan(
query={'name_contains': 'retrieve_context'},
evaluation_name='retrieved_context',
),
HasMatchingSpan(
query={'name_contains': 'generate_response'},
evaluation_name='generated_response',
),
HasMatchingSpan(
query={'and_': [
{'name_contains': 'cite'},
{'has_attribute_keys': ['source_id']},
]},
evaluation_name='added_citations',
),
]Ensure operations meet latency requirements:
from pydantic_evals.evaluators import HasMatchingSpan
evaluators = [
# Database queries should be fast
HasMatchingSpan(
query={'and_': [
{'name_contains': 'database'},
{'max_duration': 0.1}, # 100ms max
]},
evaluation_name='fast_db_queries',
),
# Overall should complete quickly
HasMatchingSpan(
query={'and_': [
{'name_equals': 'task_execution'},
{'max_duration': 2.0},
]},
evaluation_name='within_sla',
),
]Check for error conditions:
from pydantic_evals.evaluators import HasMatchingSpan
evaluators = [
# No errors occurred
HasMatchingSpan(
query={'not_': {'has_attributes': {'error': True}}},
evaluation_name='no_errors',
),
# Retries happened
HasMatchingSpan(
query={'name_contains': 'retry'},
evaluation_name='had_retries',
),
# Fallback was used
HasMatchingSpan(
query={'name_contains': 'fallback_model'},
evaluation_name='used_fallback',
),
]Verify sophisticated behavior patterns:
from pydantic_evals.evaluators import HasMatchingSpan
evaluators = [
# Agent delegated to sub-agent
HasMatchingSpan(
query={'and_': [
{'name_contains': 'agent'},
{'some_child_has': {'name_contains': 'delegate'}},
]},
evaluation_name='used_delegation',
),
# Made multiple LLM calls with retries
HasMatchingSpan(
query={'and_': [
{'name_contains': 'llm_call'},
{'some_descendant_has': {'name_contains': 'retry'}},
{'min_descendant_count': 3},
]},
evaluation_name='retry_pattern',
),
]For more complex span analysis, write custom evaluators:
from dataclasses import dataclass
from pydantic_evals.evaluators import Evaluator, EvaluatorContext
@dataclass
class CustomSpanCheck(Evaluator):
def evaluate(self, ctx: EvaluatorContext) -> dict[str, bool | int]:
span_tree = ctx.span_tree
# Find specific spans
llm_spans = span_tree.find(lambda node: 'llm' in node.name)
tool_spans = span_tree.find(lambda node: 'tool' in node.name)
# Calculate metrics
total_llm_time = sum(
span.duration.total_seconds() for span in llm_spans
)
return {
'used_llm': len(llm_spans) > 0,
'used_tools': len(tool_spans) > 0,
'tool_count': len(tool_spans),
'llm_fast': total_llm_time < 2.0,
}The [SpanTree][pydantic_evals.otel.SpanTree] provides methods for span analysis:
from pydantic_evals.otel import SpanTree
# Example API (requires span_tree from context)
def example_api(span_tree: SpanTree) -> None:
span_tree.find(lambda n: True) # Find all matching nodes
span_tree.any({'name_contains': 'test'}) # Check if any span matches
span_tree.all({'name_contains': 'test'}) # Check if all spans match
span_tree.count({'name_contains': 'test'}) # Count matching spans
# Iteration
for node in span_tree:
print(node.name, node.duration, node.attributes)Each [SpanNode][pydantic_evals.otel.SpanNode] has:
from pydantic_evals.otel import SpanNode
# Example properties (requires node from context)
def example_properties(node: SpanNode) -> None:
_ = node.name # Span name
_ = node.duration # timedelta
_ = node.attributes # dict[str, AttributeValue]
_ = node.start_timestamp # datetime
_ = node.end_timestamp # datetime
_ = node.children # list[SpanNode]
_ = node.descendants # list[SpanNode] (recursive)
_ = node.ancestors # list[SpanNode]
_ = node.parent # SpanNode | NoneIf you're sending data to Logfire, you can view all spans in the web UI to understand the trace structure.
from dataclasses import dataclass
from pydantic_evals.evaluators import Evaluator, EvaluatorContext
@dataclass
class DebugSpans(Evaluator):
def evaluate(self, ctx: EvaluatorContext) -> bool:
for node in ctx.span_tree:
print(f"{' ' * len(node.ancestors)}{node.name} ({node.duration})")
return TrueTest queries incrementally:
from pydantic_evals.evaluators import HasMatchingSpan
# Start simple
query = {'name_contains': 'tool'}
# Add conditions gradually
query = {'and_': [
{'name_contains': 'tool'},
{'max_duration': 1.0},
]}
# Test in evaluator
HasMatchingSpan(query=query, evaluation_name='test')Verify retrieval-augmented generation workflow:
from pydantic_evals.evaluators import HasMatchingSpan
evaluators = [
# Retrieved documents
HasMatchingSpan(
query={'name_contains': 'vector_search'},
evaluation_name='retrieved_docs',
),
# Reranked results
HasMatchingSpan(
query={'name_contains': 'rerank'},
evaluation_name='reranked_results',
),
# Generated with context
HasMatchingSpan(
query={'and_': [
{'name_contains': 'generate'},
{'has_attribute_keys': ['context_ids']},
]},
evaluation_name='used_context',
),
]Verify agent coordination:
from pydantic_evals.evaluators import HasMatchingSpan
evaluators = [
# Master agent ran
HasMatchingSpan(
query={'name_equals': 'master_agent'},
evaluation_name='master_ran',
),
# Delegated to specialist
HasMatchingSpan(
query={'and_': [
{'name_contains': 'specialist_agent'},
{'some_ancestor_has': {'name_equals': 'master_agent'}},
]},
evaluation_name='delegated_correctly',
),
# No circular delegation
HasMatchingSpan(
query={'not_': {'and_': [
{'name_contains': 'agent'},
{'some_descendant_has': {'name_contains': 'agent'}},
{'some_ancestor_has': {'name_contains': 'agent'}},
]}},
evaluation_name='no_circular_delegation',
),
]Verify intelligent tool selection:
from pydantic_evals.evaluators import HasMatchingSpan
evaluators = [
# Used search before answering
HasMatchingSpan(
query={'and_': [
{'name_contains': 'search'},
{'some_ancestor_has': {'name_contains': 'answer'}},
]},
evaluation_name='searched_before_answering',
),
# Limited tool calls (no loops)
HasMatchingSpan(
query={'and_': [
{'name_contains': 'tool'},
{'max_child_count': 5},
]},
evaluation_name='reasonable_tool_usage',
),
]- Start Simple: Begin with basic name queries, add complexity as needed
- Use Descriptive Names: Name your spans well in your application code
- Test Queries: Verify queries work before running full evaluations
- Combine with Other Evaluators: Use span checks alongside output validation
- Document Expectations: Comment why specific spans should/shouldn't exist
- Logfire Integration - Set up Logfire for span capture
- Custom Evaluators - Write advanced span analysis
- Built-in Evaluators - Other evaluator types