|
| 1 | +# Evals (beta) |
| 2 | + |
| 3 | +View and analyze your evaluation results in Pydantic Logfire's web interface. Evals provide observability into how your AI systems perform across different test cases and experiments over time. |
| 4 | + |
| 5 | +!!! note "Code-First Evaluation" |
| 6 | + |
| 7 | + Evals are created and run using the [Pydantic Evals](https://ai.pydantic.dev/evals/) package, which is developed in tandem with Pydantic AI. Logfire serves as an observability layer where you can view and compare results. |
| 8 | + |
| 9 | +To get started, refer to the [Pydantic Evals installation guide](https://ai.pydantic.dev/evals/#installation). |
| 10 | + |
| 11 | +## What are Evals? |
| 12 | + |
| 13 | +Evals help you systematically test and evaluate AI systems by running them against predefined test cases. Each evaluation experiment appears in Logfire automatically when you run the `pydantic_evals.Dataset.evaluate` method with Logfire integration enabled. |
| 14 | + |
| 15 | +For the data model, examples and full documentation on creating and running Evals, read the [Pydantic Evals docs](https://ai.pydantic.dev/evals/) |
| 16 | + |
| 17 | +## Viewing Experiments |
| 18 | + |
| 19 | + |
| 20 | + |
| 21 | +The Evals tab shows all experiments for your project available within your data retention period. Each experiment represents a single run of a dataset against a task function. |
| 22 | + |
| 23 | +### Experiment List |
| 24 | + |
| 25 | +Each experiment displays: |
| 26 | + |
| 27 | +- **Experiment name** - Auto-generated by Logfire (e.g., "gentle-sniff-buses") |
| 28 | +- **Task name** - The function being evaluated |
| 29 | +- **Span link** - Direct link to the detailed trace |
| 30 | +- **Created timestamp** - When the experiment was run |
| 31 | + |
| 32 | +Click on any experiment to view detailed results. |
| 33 | + |
| 34 | +### Experiment Details |
| 35 | + |
| 36 | +Individual experiment pages show comprehensive results including: |
| 37 | + |
| 38 | +- **Test cases** with inputs, expected outputs, and actual outputs |
| 39 | +- **Assertion results** - Pass/fail status for each evaluator |
| 40 | +- **Performance metrics** - Duration, token usage, and custom scores |
| 41 | +- **Evaluation scores** - Detailed scoring from all evaluators |
| 42 | + |
| 43 | + |
| 44 | + |
| 45 | +## Comparing Experiments |
| 46 | + |
| 47 | +Use the experiment comparison view to analyze performance across different runs: |
| 48 | + |
| 49 | +1. Select multiple experiments from the list |
| 50 | +2. Click **Compare selected** |
| 51 | +3. View side-by-side results for the same test cases |
| 52 | + |
| 53 | + |
| 54 | + |
| 55 | +The comparison view highlights: |
| 56 | + |
| 57 | +- **Differences in outputs** between experiment runs |
| 58 | +- **Score variations** across evaluators |
| 59 | +- **Performance changes** in metrics like duration and token usage |
| 60 | +- **Regression detection** when comparing baseline vs current implementations |
| 61 | + |
| 62 | +## Integration with Traces |
| 63 | + |
| 64 | +Every evaluation experiment generates detailed OpenTelemetry traces that appear in Logfire: |
| 65 | + |
| 66 | +- **Experiment span** - Root span containing all evaluation metadata |
| 67 | +- **Case execution spans** - Individual test case runs with full context |
| 68 | +- **Task function spans** - Detailed tracing of your AI system under test |
| 69 | +- **Evaluator spans** - Scoring and assessment execution details |
| 70 | + |
| 71 | +Navigate from experiment results to full trace details using the span links. |
0 commit comments