Framework-level testing utilities for unit testing agents

### Description

PydanticAI has excellent testing infrastructure (`TestModel`, `FunctionModel`, `capture_run_messages()`), but lacks framework-level utilities that make unit testing agents ergonomic and maintainable.

This proposal suggests adding testing helpers that complement existing tools and reduce boilerplate in agent unit tests.

---

## Problem

Currently, testing an agent requires manual inspection of captured messages:

```python
from pydantic_ai import Agent, capture_run_messages
from pydantic_ai.models.test import TestModel

agent = Agent('openai:gpt-4o', ...)

def test_weather_agent():
    with capture_run_messages() as messages:
        with agent.override(model=TestModel()):
            result = agent.run_sync('Weather in London?')

    # Manual inspection - verbose and brittle
    assert len(messages) == 4
    assert messages[1].parts[0].tool_name == 'get_forecast'
    assert messages[1].parts[0].args['location'] == 'a'  # TestModel generates 'a'
    assert isinstance(messages[3].parts[0], TextPart)
```

**Pain points:**

1. **Verbose**: Every test needs to manually index into `messages` list
2. **Brittle**: If message order changes, tests break
3. **Hard to read**: Test intent is buried in assertions
4. **No semantic assertions**: Can't easily assert "tool X was called with args Y"

---

## Proposed Solution

Add a `pydantic_testing` module (or extend existing test utilities) with:

### 1. Assertion Helpers

```python
from pydantic_ai.testing import AgentAsserter

def test_weather_agent():
    asserter = AgentAsserter(agent)

    with asserter.override(model=TestModel()):
        result = asserter.run_sync('Weather in London?')

    # Framework-level assertions
    asserter.expect_tool_called('get_forecast', location='London')
    asserter.expect_no_errors()
    asserter.expect_output_contains('weather')
```

### 2. Fixture Support with Real Data

Currently `TestModel` generates schema-valid but meaningless data (`location='a'`, `date='2024-01-01'`).

Proposal: Support YAML fixtures with realistic data:

```yaml
# fixtures/weather_scenario.yaml
model_responses:
  - on_message: "Weather in London?"
    call_tool:
      name: get_forecast
      args:
        location: London  # Extracted from prompt
    return: "Sunny, 22°C"

  - on_tool_return: "Sunny, 22°C"
    respond: "The weather in London is sunny, 22°C."
```

```python
from pydantic_ai.testing import with_fixture

@with_fixture('fixtures/weather_scenario.yaml')
def test_weather_happy_path():
    result = agent.run_sync('Weather in London?')
    assert result.output == "The weather in London is sunny, 22°C."
```

**Under the hood**: Loads fixture and generates `FunctionModel` automatically.

### 3. Flow Snapshot Testing

For regression testing complex agent flows:

```python
from pydantic_ai.testing import snapshot_flow

@snapshot_flow('snapshots/refund_flow.json')
def test_refund_flow():
    result = agent.run_sync('refund order #123')
    # Compares against snapshot:
    # - Tools called
    # - Arguments
    # - Final output
```

First run creates snapshot, subsequent runs compare against it.

---

## Benefits

1. **Developer Experience**: Less boilerplate, clearer test intent
2. **Maintainability**: Tests are less brittle to internal changes
3. **Onboarding**: Easier for new users to write good tests
4. **Best Practices**: Framework encourages testing patterns
5. **Complements Existing Tools**: Builds on `TestModel`/`FunctionModel`, doesn't replace them

---

## Implementation Options

### Option A: Minimal (just assertions)

Add `pydantic_ai.testing.AgentAsserter` with basic assertions:
- `expect_tool_called(name, **kwargs)`
- `expect_tools_called_in_order([...])`
- `expect_no_errors()`
- `expect_output_contains(text)`

**Pros**: Small scope, easy to maintain
**Cons**: Doesn't address fixture/snapshot problem

### Option B: Full Featured

Add all three components:
- Assertion helpers
- Fixture loader (YAML → `FunctionModel`)
- Snapshot testing

**Pros**: Complete solution
**Cons**: Larger scope, more maintenance

### Option C: Plugin Ecosystem

Provide hooks/extension points for community to build testing tools as separate packages.

**Pros**: Minimal framework changes, community-driven
**Cons**: Fragmentation, less discoverability

---

## Comparison to Existing Tools

| Feature | Current (`TestModel` + `capture_run_messages`) | Proposed |
|---------|-----------------------------------------------|----------|
| Mock LLM responses | ✅ Schema-based generation | ✅ Fixture-based real data |
| Assert tool calls | ❌ Manual message inspection | ✅ `expect_tool_called()` |
| Snapshot testing | ❌ Not available | ✅ Flow snapshots |
| Test fixtures | ❌ Manual setup | ✅ YAML fixtures |
| Boilerplate | ❌ High (manual indexing) | ✅ Low (declarative assertions) |

---

## Related Work

Other frameworks with similar testing utilities:
- **Pytest**: `pytest.mark.parametrize`, fixtures, `approx()`
- **FastAPI**: `TestClient` with ergonomic assertions
- **Django**: `self.assertContains()`, `self.assertRedirects()`

These frameworks show value of framework-level testing helpers.

---

## Open Questions

1. **Naming**: `pydantic_ai.testing` vs `pydantic_ai_testing` (separate package)?
2. **Scope**: Start minimal (assertions only) or full-featured?
3. **API Design**: Should this integrate with existing pytest fixtures or be standalone?
4. **Backward Compatibility**: How to avoid breaking existing tests?

---

## Alternatives Considered

### Alternative 1: "Just use existing tools"

**Argument**: `capture_run_messages()` + `TestModel` is sufficient

**Counter**: While technically sufficient, developer experience matters. Framework-level utilities reduce friction and encourage best practices.

### Alternative 2: External package

**Argument**: Build this as `pytest-pydantic-ai` plugin

**Counter**: Possible, but having official testing utilities ensures consistency and discoverability.

---

## Migration Path

If accepted, this could be rolled out gradually:

1. **Phase 1**: Add assertion helpers (`AgentAsserter`)
2. **Phase 2**: Add fixture support (YAML → `FunctionModel`)
3. **Phase 3**: Add snapshot testing
4. **Phase 4**: Update docs with testing best practices guide

Existing tests continue to work. New tests can opt into helpers.

---

## Example: Before/After

### Before (Current)

```python
def test_multi_tool_flow():
    with capture_run_messages() as messages:
        with agent.override(model=TestModel()):
            result = agent.run_sync('Search and summarize AI agents')

    # Find tool calls in messages
    tool_calls = [
        p for m in messages
        if isinstance(m, ModelResponse)
        for p in m.parts
        if isinstance(p, ToolCallPart)
    ]

    assert len(tool_calls) == 2
    assert tool_calls[0].tool_name == 'search'
    assert tool_calls[1].tool_name == 'summarize'
    assert 'search' in tool_calls[1].args.get('context', '')
```

### After (Proposed)

```python
from pydantic_ai.testing import AgentAsserter

@with_fixture('fixtures/search_and_summarize.yaml')
def test_multi_tool_flow():
    asserter = AgentAsserter(agent)
    result = asserter.run_sync('Search and summarize AI agents')

    asserter.expect_tools_called_in_order(['search', 'summarize'])
    asserter.expect_tool_called('summarize', context_contains='search')
```

---

## Impact

**Who benefits:**
- New users writing their first agent tests
- Teams maintaining large test suites
- Anyone frustrated with `capture_run_messages` boilerplate

**Estimated effort:**
- Phase 1 (assertions): ~1-2 weeks
- Phase 2 (fixtures): ~2-3 weeks
- Phase 3 (snapshots): ~1-2 weeks
- Documentation: ~1 week

---

## Request for Feedback

I'm happy to contribute this if there's interest. Would love feedback on:

1. Is this a problem worth solving?
2. Which option (A/B/C) aligns with PydanticAI's vision?
3. API design preferences?
4. Any concerns about scope/complexity?

I can start with a minimal PR (just assertion helpers) if that's easier to review.

---

## References

- Current testing docs: https://ai.pydantic.dev/testing/
- `TestModel` implementation: `pydantic_ai/models/test.py`
- `capture_run_messages`: `pydantic_ai/agent.py`

---

**Note**: I'm not affiliated with PydanticAI - just a user who's been exploring testing patterns and wanted to share feedback. Happy to contribute if this resonates with the team's priorities.

### References

## References

- Current testing docs: https://ai.pydantic.dev/testing/
- `TestModel` implementation: `pydantic_ai/models/test.py`
- `capture_run_messages`: `pydantic_ai/agent.py`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Framework-level testing utilities for unit testing agents #3138

Description

Problem

Proposed Solution

1. Assertion Helpers

2. Fixture Support with Real Data

3. Flow Snapshot Testing

Benefits

Implementation Options

Option A: Minimal (just assertions)

Option B: Full Featured

Option C: Plugin Ecosystem

Comparison to Existing Tools

Related Work

Open Questions

Alternatives Considered

Alternative 1: "Just use existing tools"

Alternative 2: External package

Migration Path

Example: Before/After

Before (Current)

After (Proposed)

Impact

Request for Feedback

References

References

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature	Current (`TestModel` + `capture_run_messages`)	Proposed
Mock LLM responses	✅ Schema-based generation	✅ Fixture-based real data
Assert tool calls	❌ Manual message inspection	✅ `expect_tool_called()`
Snapshot testing	❌ Not available	✅ Flow snapshots
Test fixtures	❌ Manual setup	✅ YAML fixtures
Boilerplate	❌ High (manual indexing)	✅ Low (declarative assertions)

Framework-level testing utilities for unit testing agents #3138

Description

Description

Problem

Proposed Solution

1. Assertion Helpers

2. Fixture Support with Real Data

3. Flow Snapshot Testing

Benefits

Implementation Options

Option A: Minimal (just assertions)

Option B: Full Featured

Option C: Plugin Ecosystem

Comparison to Existing Tools

Related Work

Open Questions

Alternatives Considered

Alternative 1: "Just use existing tools"

Alternative 2: External package

Migration Path

Example: Before/After

Before (Current)

After (Proposed)

Impact

Request for Feedback

References

References

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions