Skip to content

Commit d28dee8

Browse files
authored
adding docs for testing and evals (#99)
1 parent 1e16660 commit d28dee8

25 files changed

+454
-140
lines changed

docs/agents.md

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -246,9 +246,7 @@ print(dice_result.all_messages())
246246
),
247247
ModelStructuredResponse(
248248
calls=[
249-
ToolCall(
250-
tool_name='roll_die', args=ArgsObject(args_object={}), tool_id=None
251-
)
249+
ToolCall(tool_name='roll_die', args=ArgsDict(args_dict={}), tool_id=None)
252250
],
253251
timestamp=datetime.datetime(...),
254252
role='model-structured-response',
@@ -263,9 +261,7 @@ print(dice_result.all_messages())
263261
ModelStructuredResponse(
264262
calls=[
265263
ToolCall(
266-
tool_name='get_player_name',
267-
args=ArgsObject(args_object={}),
268-
tool_id=None,
264+
tool_name='get_player_name', args=ArgsDict(args_dict={}), tool_id=None
269265
)
270266
],
271267
timestamp=datetime.datetime(...),
@@ -485,7 +481,7 @@ except UnexpectedModelBehavior as e:
485481
calls=[
486482
ToolCall(
487483
tool_name='calc_volume',
488-
args=ArgsObject(args_object={'size': 6}),
484+
args=ArgsDict(args_dict={'size': 6}),
489485
tool_id=None,
490486
)
491487
],
@@ -503,7 +499,7 @@ except UnexpectedModelBehavior as e:
503499
calls=[
504500
ToolCall(
505501
tool_name='calc_volume',
506-
args=ArgsObject(args_object={'size': 6}),
502+
args=ArgsDict(args_dict={'size': 6}),
507503
tool_id=None,
508504
)
509505
],

docs/api/agent.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,7 @@
88
- run_sync
99
- run_stream
1010
- model
11-
- override_deps
12-
- override_model
11+
- override
1312
- last_run_messages
1413
- system_prompt
1514
- tool

docs/api/models/test.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
# `pydantic_ai.models.test`
22

3+
Utility model for quickly testing apps built with PydanticAI.
4+
35
::: pydantic_ai.models.test

docs/api/models/vertexai.md

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,6 @@ and
1010
function endpoints
1111
having the same schemas as the equivalent [Gemini endpoints][pydantic_ai.models.gemini.GeminiModel].
1212

13-
There are four advantages of using this API over the `generativelanguage.googleapis.com` API which
14-
[`GeminiModel`][pydantic_ai.models.gemini.GeminiModel] uses, and one big disadvantage.
15-
1613
## Setup
1714

1815
For details on how to set up authentication with this model as well as a comparison with the `generativelanguage.googleapis.com` API used by [`GeminiModel`][pydantic_ai.models.gemini.GeminiModel],

docs/dependencies.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -231,7 +231,7 @@ When testing agents, it's useful to be able to customise dependencies.
231231
While this can sometimes be done by calling the agent directly within unit tests, we can also override dependencies
232232
while calling application code which in turn calls the agent.
233233

234-
This is done via the [`override_deps`][pydantic_ai.Agent.override_deps] method on the agent.
234+
This is done via the [`override`][pydantic_ai.Agent.override] method on the agent.
235235

236236
```py title="joke_app.py"
237237
from dataclasses import dataclass
@@ -286,7 +286,7 @@ class TestMyDeps(MyDeps): # (1)!
286286

287287
async def test_application_code():
288288
test_deps = TestMyDeps('test_key', None) # (2)!
289-
with joke_agent.override_deps(test_deps): # (3)!
289+
with joke_agent.override(deps=test_deps): # (3)!
290290
joke = await application_code('Tell me a joke.') # (4)!
291291
assert joke.startswith('Did you hear about the toothpaste scandal?')
292292
```

docs/testing-evals.md

Lines changed: 275 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,279 @@
11
# Testing and Evals
22

3-
TODO
3+
With PydanticAI and LLM integrations in general, there are two distinct kinds of test:
44

5-
principles:
5+
1. **Unit tests** — tests of your application code, and whether it's behaving correctly
6+
2. **"Evals"** — tests of the LLM, and how good or bad its responses are
67

7-
* unit tests are no different to any other app, just `TestModel` or `FunctionModel`, we know how to do unit tests, there's no magic just good practice
8-
* evals are more like benchmarks, they never "pass" although they do "fail", you care mostly about how they change over time, we (and we think most other people) don't really know what a "good" eval is, we provide some useful tools, we'll improve this if/when a common best practice emerges, or we think we have something interesting to say
8+
For the most part, these two kinds of tests have pretty separate goals and considerations.
9+
10+
## Unit tests
11+
12+
Unit tests for PydanticAI code are just like unit tests for any other Python code.
13+
14+
Because for the most part they're nothing new, we have pretty well established tools and patterns for writing and running these kinds of tests.
15+
16+
Unless you're really sure you know better, you'll probably want to follow roughly this strategy:
17+
18+
* Use [`pytest`](https://docs.pytest.org/en/stable/) as your test harness
19+
* If you find yourself typing out long assertions, use [inline-snapshot](https://15r10nk.github.io/inline-snapshot/latest/)
20+
* Similarly, [dirty-equals](https://dirty-equals.helpmanual.io/latest/) can be useful for comparing large data structures
21+
* Use [`TestModel`][pydantic_ai.models.test.TestModel] or [`FunctionModel`][pydantic_ai.models.function.FunctionModel] in place of your actual model to avoid the cost, latency and variability of real LLM calls
22+
* Use [`Agent.override`][pydantic_ai.agent.Agent.override] to replace your model inside your application logic
23+
* Set [`ALLOW_MODEL_REQUESTS=False`][pydantic_ai.models.ALLOW_MODEL_REQUESTS] globally to block any requests from being made to non-test models accidentally
24+
25+
### Unit testing with `TestModel`
26+
27+
The simplest and fastest way to exercise most of your application code is using [`TestModel`][pydantic_ai.models.test.TestModel], this will (by default) call all tools in the agent, then return either plain text or a structured response depending on the return type of the agent.
28+
29+
!!! note "`TestModel` is not magic"
30+
The "clever" (but not too clever) part of `TestModel` is that it will attempt to generate valid structured data for [function tools](agents.md#function-tools) and [result types](results.md#structured-result-validation) based on the schema of the registered tools.
31+
32+
There's no ML or AI in `TestModel`, it's just plain old procedural Python code that tries to generate data that satisfies the JSON schema of a tool.
33+
34+
The resulting data won't look pretty or relevant, but it should pass Pydantic's validation in most cases.
35+
If you want something more sophisticated, use [`FunctionModel`][pydantic_ai.models.function.FunctionModel] and write your own data generation logic.
36+
37+
Let's write unit tests for the following application code:
38+
39+
```py title="weather_app.py"
40+
import asyncio
41+
from datetime import date
42+
43+
from pydantic_ai import Agent, CallContext
44+
45+
from fake_database import DatabaseConn # (1)!
46+
from weather_service import WeatherService # (2)!
47+
48+
weather_agent = Agent(
49+
'openai:gpt-4o',
50+
deps_type=WeatherService,
51+
system_prompt='Providing a weather forecast at the locations the user provides.',
52+
)
53+
54+
55+
@weather_agent.tool
56+
def weather_forecast(
57+
ctx: CallContext[WeatherService], location: str, forecast_date: date
58+
) -> str:
59+
if forecast_date < date.today(): # (3)!
60+
return ctx.deps.get_historic_weather(location, forecast_date)
61+
else:
62+
return ctx.deps.get_forecast(location, forecast_date)
63+
64+
65+
async def run_weather_forecast( # (3)!
66+
user_prompts: list[tuple[str, int]], conn: DatabaseConn
67+
):
68+
"""Run weather forecast for a list of user prompts and save."""
69+
async with WeatherService() as weather_service:
70+
71+
async def run_forecast(prompt: str, user_id: int):
72+
result = await weather_agent.run(prompt, deps=weather_service)
73+
await conn.store_forecast(user_id, result.data)
74+
75+
# run all prompts in parallel
76+
await asyncio.gather(
77+
*(run_forecast(prompt, user_id) for (prompt, user_id) in user_prompts)
78+
)
79+
```
80+
81+
1. `DatabaseConn` is a class that holds a database connection
82+
2. `WeatherService` has methods to get weather forecasts and historic data about the weather
83+
3. We need to call a different endpoint depending on whether the date is in the past or the future, you'll see why this nuance is important below
84+
4. This function is the code we want to test, together with the agent it uses
85+
86+
Here we have a function that takes a list of `#!python (user_prompt, user_id)` tuples, gets a weather forecast for each prompt, and stores the result in the database.
87+
88+
**We want to test this code without having to mock certain objects or modify our code so we can pass test objects in.**
89+
90+
Here's how we would write tests using [`TestModel`][pydantic_ai.models.test.TestModel]:
91+
92+
```py title="test_weather_app.py"
93+
from datetime import timezone
94+
import pytest
95+
96+
from dirty_equals import IsNow
97+
98+
from pydantic_ai import models
99+
from pydantic_ai.models.test import TestModel
100+
from pydantic_ai.messages import (
101+
SystemPrompt,
102+
UserPrompt,
103+
ModelStructuredResponse,
104+
ToolCall,
105+
ArgsDict,
106+
ToolReturn,
107+
ModelTextResponse,
108+
)
109+
110+
from fake_database import DatabaseConn
111+
from weather_app import run_weather_forecast, weather_agent
112+
113+
pytestmark = pytest.mark.anyio # (1)!
114+
models.ALLOW_MODEL_REQUESTS = False # (2)!
115+
116+
117+
async def test_forecast():
118+
conn = DatabaseConn()
119+
user_id = 1
120+
with weather_agent.override(model=TestModel()): # (3)!
121+
prompt = 'What will the weather be like in London on 2024-11-28?'
122+
await run_weather_forecast([(prompt, user_id)], conn) # (4)!
123+
124+
forecast = await conn.get_forecast(user_id)
125+
assert forecast == '{"weather_forecast":"Sunny with a chance of rain"}' # (5)!
126+
127+
assert weather_agent.last_run_messages == [ # (6)!
128+
SystemPrompt(
129+
content='Providing a weather forecast at the locations the user provides.',
130+
role='system',
131+
),
132+
UserPrompt(
133+
content='What will the weather be like in London on 2024-11-28?',
134+
timestamp=IsNow(tz=timezone.utc), # (7)!
135+
role='user',
136+
),
137+
ModelStructuredResponse(
138+
calls=[
139+
ToolCall(
140+
tool_name='weather_forecast',
141+
args=ArgsDict(
142+
args_dict={
143+
'location': 'a',
144+
'forecast_date': '2024-01-01', # (8)!
145+
}
146+
),
147+
tool_id=None,
148+
)
149+
],
150+
timestamp=IsNow(tz=timezone.utc),
151+
role='model-structured-response',
152+
),
153+
ToolReturn(
154+
tool_name='weather_forecast',
155+
content='Sunny with a chance of rain',
156+
tool_id=None,
157+
timestamp=IsNow(tz=timezone.utc),
158+
role='tool-return',
159+
),
160+
ModelTextResponse(
161+
content='{"weather_forecast":"Sunny with a chance of rain"}',
162+
timestamp=IsNow(tz=timezone.utc),
163+
role='model-text-response',
164+
),
165+
]
166+
```
167+
168+
1. We're using [anyio](https://anyio.readthedocs.io/en/stable/) to run async tests.
169+
2. This is a safety measure to make sure we don't accidentally make real requests to the LLM while testing, see [`ALLOW_MODEL_REQUESTS`][pydantic_ai.models.ALLOW_MODEL_REQUESTS] for more details.
170+
3. We're using [`Agent.override`][pydantic_ai.agent.Agent.override] to replace the agent's model with [`TestModel`][pydantic_ai.models.test.TestModel], the nice thing about `override` is that we can replace the model inside agent without needing access to the agent `run*` methods call site.
171+
4. Now we call the function we want to test inside the `override` context manager.
172+
5. But default, `TestModel` will return a JSON string summarising the tools calls made, and what was returned. If you wanted to customise the response to something more closely aligned with the domain, you could add [`custom_result_text='Sunny'`][pydantic_ai.models.test.TestModel.custom_result_text] when defining `TestModel`.
173+
6. So far we don't actually know which tools were called and with which values, we can use the [`last_run_messages`][pydantic_ai.agent.Agent.last_run_messages] attribute to inspect messages from the most recent run and assert the exchange between the agent and the model occurred as expected.
174+
7. The [`IsNow`][dirty_equals.IsNow] helper allows us to use declarative asserts even with data which will contain timestamps that change over time.
175+
8. `TestModel` isn't doing anything clever to extract values from the prompt, so these values are hardcoded.
176+
177+
### Unit testing with `FunctionModel`
178+
179+
The above tests are a great start, but careful readers will notice that the `WeatherService.get_forecast` is never called since `TestModel` calls `weather_forecast` with a date in the past.
180+
181+
To fully exercise `weather_forecast`, we need to use [`FunctionModel`][pydantic_ai.models.function.FunctionModel] to customise how the tools is called.
182+
183+
Here's an example of using `FunctionModel` to test the `weather_forecast` tool with custom inputs
184+
185+
```py title="test_weather_app2.py"
186+
import re
187+
188+
import pytest
189+
190+
from pydantic_ai import models
191+
from pydantic_ai.messages import (
192+
Message,
193+
ModelAnyResponse,
194+
ModelStructuredResponse,
195+
ModelTextResponse,
196+
ToolCall,
197+
)
198+
from pydantic_ai.models.function import AgentInfo, FunctionModel
199+
200+
from fake_database import DatabaseConn
201+
from weather_app import run_weather_forecast, weather_agent
202+
203+
pytestmark = pytest.mark.anyio
204+
models.ALLOW_MODEL_REQUESTS = False
205+
206+
207+
def call_weather_forecast( # (1)!
208+
messages: list[Message], info: AgentInfo
209+
) -> ModelAnyResponse:
210+
if len(messages) == 2:
211+
# first call, call the weather forecast tool
212+
assert set(info.function_tools.keys()) == {'weather_forecast'}
213+
214+
user_prompt = messages[1]
215+
m = re.search(r'\d{4}-\d{2}-\d{2}', user_prompt.content)
216+
assert m is not None
217+
args = {'location': 'London', 'forecast_date': m.group()} # (2)!
218+
return ModelStructuredResponse(
219+
calls=[ToolCall.from_dict('weather_forecast', args)]
220+
)
221+
else:
222+
# second call, return the forecast
223+
msg = messages[-1]
224+
assert msg.role == 'tool-return'
225+
return ModelTextResponse(f'The forecast is: {msg.content}')
226+
227+
228+
async def test_forecast_future():
229+
conn = DatabaseConn()
230+
user_id = 1
231+
with weather_agent.override(model=FunctionModel(call_weather_forecast)): # (3)!
232+
prompt = 'What will the weather be like in London on 2032-01-01?'
233+
await run_weather_forecast([(prompt, user_id)], conn)
234+
235+
forecast = await conn.get_forecast(user_id)
236+
assert forecast == 'The forecast is: Rainy with a chance of sun'
237+
```
238+
239+
1. We define a function `call_weather_forecast` that will be called by `FunctionModel` in place of the LLM, this function has access to the list of [`Message`][pydantic_ai.messages.Message]s that make up the run, and [`AgentInfo`][pydantic_ai.models.function.AgentInfo] which contains information about the agent and the function tools and return tools.
240+
2. Our function is slightly intelligent in that it tries to extract a date from the prompt, but just hard codes the location.
241+
3. We use [`FunctionModel`][pydantic_ai.models.function.FunctionModel] to replace the agent's model with our custom function.
242+
243+
### Overriding model via pytest fixtures
244+
245+
If you're writing lots of tests that all require model to be overridden, you can use [pytest fixtures](https://docs.pytest.org/en/6.2.x/fixture.html) to override the model with [`TestModel`][pydantic_ai.models.test.TestModel] or [`FunctionModel`][pydantic_ai.models.function.FunctionModel] in a reusable way.
246+
247+
Here's an example of a fixture that overrides the model with `TestModel`:
248+
249+
```py title="tests.py"
250+
import pytest
251+
from weather_app import weather_agent
252+
253+
from pydantic_ai.models.test import TestModel
254+
255+
256+
@pytest.fixture
257+
def override_weather_agent():
258+
with weather_agent.override(model=TestModel()):
259+
yield
260+
261+
262+
async def test_forecast(override_weather_agent: None):
263+
...
264+
# test code here
265+
```
266+
267+
## Evals
268+
269+
"Evals" refers to evaluating the performance of an LLM when used in a specific context.
270+
271+
Unlike unit tests, evals are an emerging art/science, anyone who tells you they know exactly how evals should be defined can safely be ignored.
272+
273+
Evals are generally more like benchmarks than unit tests, they never "pass" although they do "fail"; you care mostly about how they change over time.
274+
275+
### System prompt customization
276+
277+
The system prompt is the developer's primary tool in controlling the LLM's behavior, so it's often useful to be able to customise the system prompt and see how performance changes.
278+
279+
TODO example of customizing system prompt through deps.

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -155,6 +155,7 @@ plugins:
155155
import:
156156
- url: https://docs.python.org/3/objects.inv
157157
- url: https://docs.pydantic.dev/latest/objects.inv
158+
- url: https://dirty-equals.helpmanual.io/latest/objects.inv
158159
- url: https://fastapi.tiangolo.com/objects.inv
159160
- url: https://typing-extensions.readthedocs.io/en/latest/objects.inv
160161
- url: https://rich.readthedocs.io/en/stable/objects.inv

0 commit comments

Comments
 (0)