Skip to content

Framework-level testing utilities for unit testing agents #3138

@amnox

Description

@amnox

Description

PydanticAI has excellent testing infrastructure (TestModel, FunctionModel, capture_run_messages()), but lacks framework-level utilities that make unit testing agents ergonomic and maintainable.

This proposal suggests adding testing helpers that complement existing tools and reduce boilerplate in agent unit tests.


Problem

Currently, testing an agent requires manual inspection of captured messages:

from pydantic_ai import Agent, capture_run_messages
from pydantic_ai.models.test import TestModel

agent = Agent('openai:gpt-4o', ...)

def test_weather_agent():
    with capture_run_messages() as messages:
        with agent.override(model=TestModel()):
            result = agent.run_sync('Weather in London?')

    # Manual inspection - verbose and brittle
    assert len(messages) == 4
    assert messages[1].parts[0].tool_name == 'get_forecast'
    assert messages[1].parts[0].args['location'] == 'a'  # TestModel generates 'a'
    assert isinstance(messages[3].parts[0], TextPart)

Pain points:

  1. Verbose: Every test needs to manually index into messages list
  2. Brittle: If message order changes, tests break
  3. Hard to read: Test intent is buried in assertions
  4. No semantic assertions: Can't easily assert "tool X was called with args Y"

Proposed Solution

Add a pydantic_testing module (or extend existing test utilities) with:

1. Assertion Helpers

from pydantic_ai.testing import AgentAsserter

def test_weather_agent():
    asserter = AgentAsserter(agent)

    with asserter.override(model=TestModel()):
        result = asserter.run_sync('Weather in London?')

    # Framework-level assertions
    asserter.expect_tool_called('get_forecast', location='London')
    asserter.expect_no_errors()
    asserter.expect_output_contains('weather')

2. Fixture Support with Real Data

Currently TestModel generates schema-valid but meaningless data (location='a', date='2024-01-01').

Proposal: Support YAML fixtures with realistic data:

# fixtures/weather_scenario.yaml
model_responses:
  - on_message: "Weather in London?"
    call_tool:
      name: get_forecast
      args:
        location: London  # Extracted from prompt
    return: "Sunny, 22°C"

  - on_tool_return: "Sunny, 22°C"
    respond: "The weather in London is sunny, 22°C."
from pydantic_ai.testing import with_fixture

@with_fixture('fixtures/weather_scenario.yaml')
def test_weather_happy_path():
    result = agent.run_sync('Weather in London?')
    assert result.output == "The weather in London is sunny, 22°C."

Under the hood: Loads fixture and generates FunctionModel automatically.

3. Flow Snapshot Testing

For regression testing complex agent flows:

from pydantic_ai.testing import snapshot_flow

@snapshot_flow('snapshots/refund_flow.json')
def test_refund_flow():
    result = agent.run_sync('refund order #123')
    # Compares against snapshot:
    # - Tools called
    # - Arguments
    # - Final output

First run creates snapshot, subsequent runs compare against it.


Benefits

  1. Developer Experience: Less boilerplate, clearer test intent
  2. Maintainability: Tests are less brittle to internal changes
  3. Onboarding: Easier for new users to write good tests
  4. Best Practices: Framework encourages testing patterns
  5. Complements Existing Tools: Builds on TestModel/FunctionModel, doesn't replace them

Implementation Options

Option A: Minimal (just assertions)

Add pydantic_ai.testing.AgentAsserter with basic assertions:

  • expect_tool_called(name, **kwargs)
  • expect_tools_called_in_order([...])
  • expect_no_errors()
  • expect_output_contains(text)

Pros: Small scope, easy to maintain
Cons: Doesn't address fixture/snapshot problem

Option B: Full Featured

Add all three components:

  • Assertion helpers
  • Fixture loader (YAML → FunctionModel)
  • Snapshot testing

Pros: Complete solution
Cons: Larger scope, more maintenance

Option C: Plugin Ecosystem

Provide hooks/extension points for community to build testing tools as separate packages.

Pros: Minimal framework changes, community-driven
Cons: Fragmentation, less discoverability


Comparison to Existing Tools

Feature Current (TestModel + capture_run_messages) Proposed
Mock LLM responses ✅ Schema-based generation ✅ Fixture-based real data
Assert tool calls ❌ Manual message inspection expect_tool_called()
Snapshot testing ❌ Not available ✅ Flow snapshots
Test fixtures ❌ Manual setup ✅ YAML fixtures
Boilerplate ❌ High (manual indexing) ✅ Low (declarative assertions)

Related Work

Other frameworks with similar testing utilities:

  • Pytest: pytest.mark.parametrize, fixtures, approx()
  • FastAPI: TestClient with ergonomic assertions
  • Django: self.assertContains(), self.assertRedirects()

These frameworks show value of framework-level testing helpers.


Open Questions

  1. Naming: pydantic_ai.testing vs pydantic_ai_testing (separate package)?
  2. Scope: Start minimal (assertions only) or full-featured?
  3. API Design: Should this integrate with existing pytest fixtures or be standalone?
  4. Backward Compatibility: How to avoid breaking existing tests?

Alternatives Considered

Alternative 1: "Just use existing tools"

Argument: capture_run_messages() + TestModel is sufficient

Counter: While technically sufficient, developer experience matters. Framework-level utilities reduce friction and encourage best practices.

Alternative 2: External package

Argument: Build this as pytest-pydantic-ai plugin

Counter: Possible, but having official testing utilities ensures consistency and discoverability.


Migration Path

If accepted, this could be rolled out gradually:

  1. Phase 1: Add assertion helpers (AgentAsserter)
  2. Phase 2: Add fixture support (YAML → FunctionModel)
  3. Phase 3: Add snapshot testing
  4. Phase 4: Update docs with testing best practices guide

Existing tests continue to work. New tests can opt into helpers.


Example: Before/After

Before (Current)

def test_multi_tool_flow():
    with capture_run_messages() as messages:
        with agent.override(model=TestModel()):
            result = agent.run_sync('Search and summarize AI agents')

    # Find tool calls in messages
    tool_calls = [
        p for m in messages
        if isinstance(m, ModelResponse)
        for p in m.parts
        if isinstance(p, ToolCallPart)
    ]

    assert len(tool_calls) == 2
    assert tool_calls[0].tool_name == 'search'
    assert tool_calls[1].tool_name == 'summarize'
    assert 'search' in tool_calls[1].args.get('context', '')

After (Proposed)

from pydantic_ai.testing import AgentAsserter

@with_fixture('fixtures/search_and_summarize.yaml')
def test_multi_tool_flow():
    asserter = AgentAsserter(agent)
    result = asserter.run_sync('Search and summarize AI agents')

    asserter.expect_tools_called_in_order(['search', 'summarize'])
    asserter.expect_tool_called('summarize', context_contains='search')

Impact

Who benefits:

  • New users writing their first agent tests
  • Teams maintaining large test suites
  • Anyone frustrated with capture_run_messages boilerplate

Estimated effort:

  • Phase 1 (assertions): ~1-2 weeks
  • Phase 2 (fixtures): ~2-3 weeks
  • Phase 3 (snapshots): ~1-2 weeks
  • Documentation: ~1 week

Request for Feedback

I'm happy to contribute this if there's interest. Would love feedback on:

  1. Is this a problem worth solving?
  2. Which option (A/B/C) aligns with PydanticAI's vision?
  3. API design preferences?
  4. Any concerns about scope/complexity?

I can start with a minimal PR (just assertion helpers) if that's easier to review.


References


Note: I'm not affiliated with PydanticAI - just a user who's been exploring testing patterns and wanted to share feedback. Happy to contribute if this resonates with the team's priorities.

References

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions