You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"Evals" refers to evaluating the performance of an LLM when used in a specific context.
269
+
"Evals" refers to evaluating a models performance for a specific application.
270
270
271
-
Unlike unit tests, evals are an emerging art/science, anyone who tells you they know exactly how evals should be defined can safely be ignored.
271
+
!!! danger "Warning"
272
+
Unlike unit tests, evals are an emerging art/science; anyone who claims to know for sure exactly how your evals should be defined can safely be ignored.
272
273
273
274
Evals are generally more like benchmarks than unit tests, they never "pass" although they do "fail"; you care mostly about how they change over time.
274
275
276
+
Since evals need to be run against the real model, then can be slow and expensive to run, you generally won't want to run them in CI for every commit.
277
+
278
+
### Measuring performance
279
+
280
+
The hardest part of evals is measuring how well the model has performed.
281
+
282
+
In some cases (e.g. an agent to generate SQL) there are simple, easy to run tests that can be used to measure performance (e.g. is the SQL valid? Does it return the right results? Does it return just the right results?).
283
+
284
+
In other cases (e.g. an agent that gives advice on quitting smoking) it can be very hard or impossible to make quantitative measures of performance — in the smoking case you'd really need to run a double-blind trial over months, then wait 40 years and observe health outcomes to know if changes to your prompt were an improvement.
285
+
286
+
There are a few different strategies you can use to measure performance:
287
+
288
+
***End to end, self-contained tests** — like the SQL example, we can test the final result of the agent near-instantly
289
+
***Synthetic self-contained tests** — writing unit test style checks that the output is as expected, checks like `#!python 'chewing gum' in response`, while these checks might seem simplistic they can be helpful, one nice characteristic is that it's easy to tell what's wrong when they fail
290
+
***LLMs evaluating LLMs** — using another models, or even the same model with a different prompt to evaluate the performance of the agent (like when the class marks each other's homework because the teacher has a hangover), while the downsides and complexities of this approach are obvious, some think it can be a useful tool in the right circumstances
291
+
***Evals in prod** — measuring the end results of the agent in production, then creating a quantitative measure of performance, so you can easily measure changes over time as you change the prompt or model used, [logfire](logfire.md) can be extremely useful in this case since you can write a custom query to measure the performance of your agent
292
+
275
293
### System prompt customization
276
294
277
-
The system prompt is the developer's primary tool in controlling the LLM's behavior, so it's often useful to be able to customise the system prompt and see how performance changes.
295
+
The system prompt is the developer's primary tool in controlling an agent's behavior, so it's often useful to be able to customise the system prompt and see how performance changes. This is particularly relevant when the system prompt contains a list of examples and you want to understand how changing that list affects the model's performance.
296
+
297
+
Let's assume we have the following app for running SQL generated from a user prompt (this examples omits a lot of details for brevity, see the [SQL gen](examples/sql-gen.md) example for a more complete code):
"""Search the database based on the user's prompts."""
358
+
...# (4)!
359
+
result =await sql_agent.run(user_prompt, deps=SqlSystemPrompt())
360
+
conn = DatabaseConn()
361
+
returnawait conn.execute(result.data)
362
+
```
363
+
364
+
`examples.json` looks something like this:
365
+
366
+
367
+
request: show me error records with the tag "foobar"
368
+
response: SELECT * FROM records WHERE level = 'error' and 'foobar' = ANY(tags)
369
+
370
+
```json title="examples.json"
371
+
{
372
+
"examples": [
373
+
{
374
+
"request": "Show me all records",
375
+
"sql": "SELECT * FROM records;"
376
+
},
377
+
{
378
+
"request": "Show me all records from 2021",
379
+
"sql": "SELECT * FROM records WHERE date_trunc('year', date) = '2021-01-01';"
380
+
},
381
+
{
382
+
"request": "show me error records with the tag 'foobar'",
383
+
"sql": "SELECT * FROM records WHERE level = 'error' and 'foobar' = ANY(tags);"
384
+
},
385
+
...
386
+
]
387
+
}
388
+
```
389
+
390
+
Now we want a way to quantify the success of the SQL generation so we can judge how changes to the agent affect its performance.
391
+
392
+
We can use [`Agent.override`][pydantic_ai.agent.Agent.override] to replace the system prompt with a custom one that uses a subset of examples, and then run the application code (in this case `user_search`). We also run the actual SQL from the examples and compare the "correct" result from the example SQL to the SQL generated by the agent. (We compare the results of running the SQL rather than the SQL itself since the SQL might be semantically equivalent but written in a different way).
393
+
394
+
To get a quantitative measure of performance, we assign points to each run as follows:
395
+
***-100** points if the generated SQL is invalid
396
+
***-1** point for each row returned by the agent (so returning lots of results is discouraged)
397
+
***+5** points for each row returned by the agent that matches the expected result
398
+
399
+
We use 5-fold cross-validation to judge the performance of the agent using our existing set of examples.
400
+
401
+
```py title="test_sql_app.py"
402
+
import json
403
+
import statistics
404
+
from pathlib import Path
405
+
from itertools import chain
406
+
407
+
from fake_database import DatabaseConn, QueryError
408
+
from sql_app import sql_agent, SqlSystemPrompt, user_search
409
+
410
+
411
+
asyncdefmain():
412
+
with Path('examples.json').open('rb') as f:
413
+
examples = json.load(f)
414
+
415
+
# split examples into 5 folds
416
+
fold_size =len(examples) //5
417
+
folds = [examples[i : i + fold_size] for i inrange(0, len(examples), fold_size)]
418
+
conn = DatabaseConn()
419
+
scores = []
420
+
421
+
for i, fold inenumerate(folds, start=1):
422
+
fold_score =0
423
+
# build all other folds into a list of examples
424
+
other_folds =list(chain(*(f for j, f inenumerate(folds) if j != i)))
425
+
# create a new system prompt with the other fold examples
0 commit comments