Evals docs revisions (#2849)

summerscope · dmontagu · web-flow · commit 8e652f0837d4 · 2025-09-17T14:00:10.000+02:00
Co-authored-by: David Montague &lt;35119617+dmontagu@users.noreply.github.com&gt;
diff --git a/docs/evals.md b/docs/evals.md
@@ -1,17 +1,29 @@
 # Evals
 
-"Evals" refers to evaluating a model's performance for a specific application.
+_Evals_ is shorthand for both AI system _Evaluation_ as a broad topic and for specific _Evaluation Metrics_ or _Evaluators_ as individual tests. Ironically, the overloading of this term makes it difficult to evaluate what people are even talking about when they say "Evals" (without further context).
 
 !!! danger "Warning"
-    Unlike unit tests, evals are an emerging art/science; anyone who claims to know for sure exactly how your evals should be defined can safely be ignored.
+    Unlike unit tests, evals are an emerging art/science; anyone who claims to know exactly how your evals should be defined can safely be ignored.
 
-Pydantic Evals is a powerful evaluation framework designed to help you systematically test and evaluate the performance and accuracy of the systems you build, especially when working with LLMs.
+## Pydantic Evals Package
+
+**Pydantic Evals** is a powerful evaluation framework designed to help you systematically test and evaluate the performance and accuracy of the systems you build, from augmented LLMs to multi-agent systems.
+
+Install Pydantic Evals as part of the Pydantic AI (agent framework) package, or stand-alone.
 
 We've designed Pydantic Evals to be useful while not being too opinionated since we (along with everyone else) are still figuring out best practices. We'd love your [feedback](help.md) on the package and how we can improve it.
 
 !!! note "In Beta"
     Pydantic Evals support was [introduced](https://github.com/pydantic/pydantic-ai/pull/935) in v0.0.47 and is currently in beta. The API is subject to change and the documentation is incomplete.
 
+## Code-First Evaluation
+
+Pydantic Evals follows a **code-first approach** where you define all evaluation components (datasets, experiments, tasks, cases and evaluators) in Python code, or as serialized data loaded by Python code. This differs from platforms with fully web-based configuration.
+
+When you run an _Experiment_ you'll see a progress indicator and can print the results wherever you run your python code (IDE, terminal, etc). You also get a report object back that you can serialize and store or send to a notebook or other application for further visualization and analysis.
+
+If you are using [Pydantic Logfire](https://logfire.pydantic.dev/docs/guides/web-ui/evals/), your experiment results automatically appear in the Logfire web interface for visualization, comparison, and collaborative analysis. Logfire serves as a observability layer - you write and run evals in code, then view and analyze results in the web UI.
+
 ## Installation
 
 To install the Pydantic Evals package, run:
@@ -27,12 +39,69 @@ use OpenTelemetry traces in your evals, or send evaluation results to [logfire](
 pip/uv-add 'pydantic-evals[logfire]'
 ```
 
+## Pydantic Evals Data Model
+
+### Data Model Diagram
+
+```
+Dataset (1) ──────────── (Many) Case
+│                        │
+│                        │
+└─── (Many) Experiment ──┴─── (Many) Case results
+     │
+     └─── (1) Task
+     │
+     └─── (Many) Evaluator
+```
+
+### Key Relationships
+
+1. **Dataset → Cases**: One Dataset contains many Cases
+2. **Dataset → Experiments**: One Dataset can be used across many Experiments
+   over time
+3. **Experiment → Case results**: One Experiment generates results by
+   executing each Case
+4. **Experiment → Task**: One Experiment evaluates one defined Task
+5. **Experiment → Evaluators**: One Experiment uses multiple Evaluators. Dataset-wide Evaluators are run against all Cases, and Case-specific Evaluators against their respective Cases
+
+### Data Flow
+
+1. **Dataset creation**: Define cases and evaluators in YAML/JSON, or directly in Python
+2. **Experiment execution**: Run `dataset.evaluate_sync(task_function)`
+3. **Cases run**: Each Case is executed against the Task
+4. **Evaluation**: Evaluators score the Task outputs for each Case
+5. **Results**: All Case results are collected into a summary report
+
+!!! note "A metaphor"
+
+    A useful metaphor (although not perfect) is to think of evals like a **Unit Testing** framework:
+
+    - **Cases + Evaluators** are your individual unit tests - each one
+    defines a specific scenario you want to test, complete with inputs
+    and expected outcomes. Just like a unit test, a case asks: _"Given
+    this input, does my system produce the right output?"_
+
+    -  **Datasets** are like test suites - they are the scaffolding that holds your unit
+    tests together. They group related cases and define shared
+    evaluation criteria that should apply across all tests in the suite.
+
+    - **Experiments** are like running your entire test suite and getting a
+    report. When you execute `dataset.evaluate_sync(my_ai_function)`,
+    you're running all your cases against your AI system and
+    collecting the results - just like running `pytest` and getting a
+    summary of passes, failures, and performance metrics.
+
+    The key difference from traditional unit testing is that AI systems are
+    probabilistic. If you're type checking you'll still get a simple pass/fail,
+    but scores for text outputs are likely qualitative and/or categorical,
+    and more open to interpretation.
+
 ## Datasets and Cases
 
 In Pydantic Evals, everything begins with `Dataset`s and `Case`s:
 
-- [`Case`][pydantic_evals.Case]: A single test scenario corresponding to "task" inputs. Can also optionally have a name, expected outputs, metadata, and evaluators.
-- [`Dataset`][pydantic_evals.Dataset]: A collection of test cases designed for the evaluation of a specific task or function.
+- [`Case`][pydantic_evals.Case]: A single test scenario corresponding to Task inputs. Can also optionally have a name, expected outputs, metadata, and evaluators.
+- [`Dataset`][pydantic_evals.Dataset]: A collection of test Cases designed for the evaluation of a specific task or function.
 
 ```python {title="simple_eval_dataset.py"}
 from pydantic_evals import Case, Dataset
@@ -51,9 +120,13 @@ _(This example is complete, it can be run "as is")_
 
 ## Evaluators
 
-Evaluators are the components that analyze and score the results of your task when tested against a case.
+Evaluators analyze and score the results of your Task when tested against a Case.
 
-Pydantic Evals includes several built-in evaluators and allows you to create custom evaluators:
+These can be a classic unit test: deterministic, code-based checks, such as testing model output format with a regex, or checking for the appearance of PII or sensitive data. Alternatively Evaluators can assess the non-deterministic model outputs for qualities like accuracy, precision/recall, hallucinations or instruction-following.
+
+While both kinds of testing are useful in LLM systems, classical code-based tests are cheaper and easier than tests which require either human or machine review of model outputs. We encourage you to look for quick wins of this type, when setting up a test framework for your system.
+
+Pydantic Evals includes several built-in evaluators and allows you to define custom evaluators:
 
 ```python {title="simple_eval_evaluator.py" requires="simple_eval_dataset.py"}
 from dataclasses import dataclass
@@ -88,9 +161,9 @@ dataset.add_evaluator(MyEvaluator())
 
 _(This example is complete, it can be run "as is")_
 
-## Evaluation Process
+## Running Experiments
 
-The evaluation process involves running a task against all cases in a dataset:
+Performing evaluations involves running a task against all cases in a dataset, also known as running an "experiment".
 
 Putting the above two examples together and using the more declarative `evaluators` kwarg to [`Dataset`][pydantic_evals.Dataset]:
 
@@ -691,7 +764,7 @@ Pydantic Evals is implemented using OpenTelemetry to record traces of the evalua
 the information included in the terminal output as attributes, but also include full tracing from the executions of the
 evaluation task function.
 
-You can send these traces to any OpenTelemetry-compatible backend, including [Pydantic Logfire](https://logfire.pydantic.dev/docs).
+You can send these traces to any OpenTelemetry-compatible backend, including [Pydantic Logfire](https://logfire.pydantic.dev/docs/guides/web-ui/evals/).
 
 All you need to do is configure Logfire via `logfire.configure`:
 
@@ -738,3 +811,16 @@ to ensure specific tools are (or are not) called during the execution of specifi
 Using OpenTelemetry in this way also means that all data used to evaluate the task executions will be accessible in
 the traces produced by production runs of the code, making it straightforward to perform the same evaluations on
 production data.
+
+## API Reference
+
+For comprehensive coverage of all classes, methods, and configuration options, see the detailed [API Reference documentation](https://ai.pydantic.dev/api/pydantic_evals/dataset/).
+
+## Next Steps
+
+<!-- TODO - this would be the perfect place for a full tutorial or case study  -->
+1. **Start with simple evaluations** using basic evaluators like [`IsInstance`](https://ai.pydantic.dev/api/pydantic_evals/evaluators/#pydantic_evals.evaluators.IsInstance) and [`EqualsExpected`](https://ai.pydantic.dev/api/pydantic_evals/evaluators/#pydantic_evals.evaluators.EqualsExpected)
+2. **Integrate with Logfire** to visualize results and enable team collaboration
+3. **Build comprehensive test suites** with diverse cases covering edge cases and performance requirements
+4. **Implement custom evaluators** for domain-specific quality metrics
+5. **Automate evaluation runs** as part of your development and deployment pipeline