You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/evals.md
+96-10Lines changed: 96 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,17 +1,29 @@
1
1
# Evals
2
2
3
-
"Evals" refers to evaluating a model's performance for a specific application.
3
+
_Evals_ is shorthand for both AI system _Evaluation_ as a broad topic and for specific _Evaluation Metrics_ or _Evaluators_ as individual tests. Ironically, the overloading of this term makes it difficult to evaluate what people are even talking about when they say "Evals" (without further context).
4
4
5
5
!!! danger "Warning"
6
-
Unlike unit tests, evals are an emerging art/science; anyone who claims to know for sure exactly how your evals should be defined can safely be ignored.
6
+
Unlike unit tests, evals are an emerging art/science; anyone who claims to know exactly how your evals should be defined can safely be ignored.
7
7
8
-
Pydantic Evals is a powerful evaluation framework designed to help you systematically test and evaluate the performance and accuracy of the systems you build, especially when working with LLMs.
8
+
## Pydantic Evals Package
9
+
10
+
**Pydantic Evals** is a powerful evaluation framework designed to help you systematically test and evaluate the performance and accuracy of the systems you build, from augmented LLMs to multi-agent systems.
11
+
12
+
Install Pydantic Evals as part of the Pydantic AI (agent framework) package, or stand-alone.
9
13
10
14
We've designed Pydantic Evals to be useful while not being too opinionated since we (along with everyone else) are still figuring out best practices. We'd love your [feedback](help.md) on the package and how we can improve it.
11
15
12
16
!!! note "In Beta"
13
17
Pydantic Evals support was [introduced](https://github.com/pydantic/pydantic-ai/pull/935) in v0.0.47 and is currently in beta. The API is subject to change and the documentation is incomplete.
14
18
19
+
## Code-First Evaluation
20
+
21
+
Pydantic Evals follows a **code-first approach** where you define all evaluation components (datasets, experiments, tasks, cases and evaluators) in Python code, or as serialized data loaded by Python code. This differs from platforms with fully web-based configuration.
22
+
23
+
When you run an _Experiment_ you'll see a progress indicator and can print the results wherever you run your python code (IDE, terminal, etc). You also get a report object back that you can serialize and store or send to a notebook or other application for further visualization and analysis.
24
+
25
+
If you are using [Pydantic Logfire](https://logfire.pydantic.dev/docs/guides/web-ui/evals/), your experiment results automatically appear in the Logfire web interface for visualization, comparison, and collaborative analysis. Logfire serves as a observability layer - you write and run evals in code, then view and analyze results in the web UI.
26
+
15
27
## Installation
16
28
17
29
To install the Pydantic Evals package, run:
@@ -27,12 +39,69 @@ use OpenTelemetry traces in your evals, or send evaluation results to [logfire](
27
39
pip/uv-add 'pydantic-evals[logfire]'
28
40
```
29
41
42
+
## Pydantic Evals Data Model
43
+
44
+
### Data Model Diagram
45
+
46
+
```
47
+
Dataset (1) ──────────── (Many) Case
48
+
│ │
49
+
│ │
50
+
└─── (Many) Experiment ──┴─── (Many) Case results
51
+
│
52
+
└─── (1) Task
53
+
│
54
+
└─── (Many) Evaluator
55
+
```
56
+
57
+
### Key Relationships
58
+
59
+
1.**Dataset → Cases**: One Dataset contains many Cases
60
+
2.**Dataset → Experiments**: One Dataset can be used across many Experiments
61
+
over time
62
+
3.**Experiment → Case results**: One Experiment generates results by
63
+
executing each Case
64
+
4.**Experiment → Task**: One Experiment evaluates one defined Task
65
+
5.**Experiment → Evaluators**: One Experiment uses multiple Evaluators. Dataset-wide Evaluators are run against all Cases, and Case-specific Evaluators against their respective Cases
66
+
67
+
### Data Flow
68
+
69
+
1.**Dataset creation**: Define cases and evaluators in YAML/JSON, or directly in Python
70
+
2.**Experiment execution**: Run `dataset.evaluate_sync(task_function)`
71
+
3.**Cases run**: Each Case is executed against the Task
72
+
4.**Evaluation**: Evaluators score the Task outputs for each Case
73
+
5.**Results**: All Case results are collected into a summary report
74
+
75
+
!!! note "A metaphor"
76
+
77
+
A useful metaphor (although not perfect) is to think of evals like a **Unit Testing** framework:
78
+
79
+
- **Cases + Evaluators** are your individual unit tests - each one
80
+
defines a specific scenario you want to test, complete with inputs
81
+
and expected outcomes. Just like a unit test, a case asks: _"Given
82
+
this input, does my system produce the right output?"_
83
+
84
+
- **Datasets** are like test suites - they are the scaffolding that holds your unit
85
+
tests together. They group related cases and define shared
86
+
evaluation criteria that should apply across all tests in the suite.
87
+
88
+
- **Experiments** are like running your entire test suite and getting a
89
+
report. When you execute `dataset.evaluate_sync(my_ai_function)`,
90
+
you're running all your cases against your AI system and
91
+
collecting the results - just like running `pytest` and getting a
92
+
summary of passes, failures, and performance metrics.
93
+
94
+
The key difference from traditional unit testing is that AI systems are
95
+
probabilistic. If you're type checking you'll still get a simple pass/fail,
96
+
but scores for text outputs are likely qualitative and/or categorical,
97
+
and more open to interpretation.
98
+
30
99
## Datasets and Cases
31
100
32
101
In Pydantic Evals, everything begins with `Dataset`s and `Case`s:
33
102
34
-
-[`Case`][pydantic_evals.Case]: A single test scenario corresponding to "task" inputs. Can also optionally have a name, expected outputs, metadata, and evaluators.
35
-
-[`Dataset`][pydantic_evals.Dataset]: A collection of test cases designed for the evaluation of a specific task or function.
103
+
-[`Case`][pydantic_evals.Case]: A single test scenario corresponding to Task inputs. Can also optionally have a name, expected outputs, metadata, and evaluators.
104
+
-[`Dataset`][pydantic_evals.Dataset]: A collection of test Cases designed for the evaluation of a specific task or function.
36
105
37
106
```python {title="simple_eval_dataset.py"}
38
107
from pydantic_evals import Case, Dataset
@@ -51,9 +120,13 @@ _(This example is complete, it can be run "as is")_
51
120
52
121
## Evaluators
53
122
54
-
Evaluators are the components that analyze and score the results of your task when tested against a case.
123
+
Evaluators analyze and score the results of your Task when tested against a Case.
55
124
56
-
Pydantic Evals includes several built-in evaluators and allows you to create custom evaluators:
125
+
These can be a classic unit test: deterministic, code-based checks, such as testing model output format with a regex, or checking for the appearance of PII or sensitive data. Alternatively Evaluators can assess the non-deterministic model outputs for qualities like accuracy, precision/recall, hallucinations or instruction-following.
126
+
127
+
While both kinds of testing are useful in LLM systems, classical code-based tests are cheaper and easier than tests which require either human or machine review of model outputs. We encourage you to look for quick wins of this type, when setting up a test framework for your system.
128
+
129
+
Pydantic Evals includes several built-in evaluators and allows you to define custom evaluators:
_(This example is complete, it can be run "as is")_
90
163
91
-
## Evaluation Process
164
+
## Running Experiments
92
165
93
-
The evaluation process involves running a task against all cases in a dataset:
166
+
Performing evaluations involves running a task against all cases in a dataset, also known as running an "experiment".
94
167
95
168
Putting the above two examples together and using the more declarative `evaluators` kwarg to [`Dataset`][pydantic_evals.Dataset]:
96
169
@@ -691,7 +764,7 @@ Pydantic Evals is implemented using OpenTelemetry to record traces of the evalua
691
764
the information included in the terminal output as attributes, but also include full tracing from the executions of the
692
765
evaluation task function.
693
766
694
-
You can send these traces to any OpenTelemetry-compatible backend, including [Pydantic Logfire](https://logfire.pydantic.dev/docs).
767
+
You can send these traces to any OpenTelemetry-compatible backend, including [Pydantic Logfire](https://logfire.pydantic.dev/docs/guides/web-ui/evals/).
695
768
696
769
All you need to do is configure Logfire via `logfire.configure`:
697
770
@@ -738,3 +811,16 @@ to ensure specific tools are (or are not) called during the execution of specifi
738
811
Using OpenTelemetry in this way also means that all data used to evaluate the task executions will be accessible in
739
812
the traces produced by production runs of the code, making it straightforward to perform the same evaluations on
740
813
production data.
814
+
815
+
## API Reference
816
+
817
+
For comprehensive coverage of all classes, methods, and configuration options, see the detailed [API Reference documentation](https://ai.pydantic.dev/api/pydantic_evals/dataset/).
818
+
819
+
## Next Steps
820
+
821
+
<!-- TODO - this would be the perfect place for a full tutorial or case study -->
822
+
1.**Start with simple evaluations** using basic evaluators like [`IsInstance`](https://ai.pydantic.dev/api/pydantic_evals/evaluators/#pydantic_evals.evaluators.IsInstance) and [`EqualsExpected`](https://ai.pydantic.dev/api/pydantic_evals/evaluators/#pydantic_evals.evaluators.EqualsExpected)
823
+
2.**Integrate with Logfire** to visualize results and enable team collaboration
824
+
3.**Build comprehensive test suites** with diverse cases covering edge cases and performance requirements
825
+
4.**Implement custom evaluators** for domain-specific quality metrics
826
+
5.**Automate evaluation runs** as part of your development and deployment pipeline
0 commit comments