Skip to content

Commit f6b267d

Browse files
authored
Add some discussion of the logfire integration to the evals docs (#1319)
1 parent 007d9d1 commit f6b267d

File tree

4 files changed

+54
-0
lines changed

4 files changed

+54
-0
lines changed

docs/evals.md

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -688,3 +688,57 @@ async def main():
688688
2. Save the dataset to a JSON file, this will also write `questions_cases_schema.json` with th JSON schema for `questions_cases.json`. This time the `$schema` key is included in the JSON file to define the schema for IDEs to use while you edit the file, there's no formal spec for this, but it works in vscode and pycharm and is discussed at length in [json-schema-org/json-schema-spec#828](https://github.com/json-schema-org/json-schema-spec/issues/828).
689689

690690
_(This example is complete, it can be run "as is" — you'll need to add `asyncio.run(main(answer))` to run `main`)_
691+
692+
## Integration with Logfire
693+
694+
Pydantic Evals is implemented using OpenTelemetry to record traces of the evaluation process. These traces contain all
695+
the information included in the terminal output as attributes, but also include full tracing from the executions of the
696+
evaluation task function.
697+
698+
You can send these traces to any OpenTelemetry-compatible backend, including [Pydantic Logfire](https://logfire.pydantic.dev/docs).
699+
700+
All you need to do is configure Logfire via `logfire.configure`:
701+
702+
```python {title="logfire_integration.py"}
703+
import logfire
704+
from judge_recipes import recipe_dataset, transform_recipe
705+
706+
logfire.configure(
707+
send_to_logfire='if-token-present', # (1)!
708+
environment='development', # (2)!
709+
service_name='evals', # (3)!
710+
)
711+
712+
recipe_dataset.evaluate_sync(transform_recipe)
713+
```
714+
715+
1. The `send_to_logfire` argument controls when traces are sent to Logfire. You can set it to `'if-token-present'` to send data to Logfire only if the `LOGFIRE_TOKEN` environment variable is set. See the [Logfire configuration docs](https://logfire.pydantic.dev/docs/reference/configuration/) for more details.
716+
2. The `environment` argument sets the environment for the traces. It's a good idea to set this to `'development'` when running tests or evaluations and sending data to a project with production data, to make it easier to filter these traces out while reviewing data from your production environment(s).
717+
3. The `service_name` argument sets the service name for the traces. This is displayed in the Logfire UI to help you identify the source of the associated spans.
718+
719+
Logfire has some special integration with Pydantic Evals traces, including a table view of the evaluation results
720+
on the evaluation root span (which is generated in each call to [`Dataset.evaluate`][pydantic_evals.Dataset.evaluate]):
721+
722+
![Logfire Evals Overview](img/logfire-evals-overview.png)
723+
724+
and a detailed view of the inputs and outputs for the execution of each case:
725+
726+
![Logfire Evals Case](img/logfire-evals-case.png)
727+
728+
In addition, any OpenTelemetry spans generated during the evaluation process will be sent to Logfire, allowing you to
729+
visualize the full execution of the code called during the evaluation process:
730+
731+
![Logfire Evals Case Trace](img/logfire-evals-case-trace.png)
732+
733+
This can be especially helpful when attempting to write evaluators that make use of the `span_tree` property of the
734+
[`EvaluatorContext`][pydantic_evals.evaluators.context.EvaluatorContext], as described in the
735+
[OpenTelemetry Integration](#opentelemetry-integration) section above.
736+
737+
This allows you to write evaluations that depend on information about which code paths were executed during the call to
738+
the task function without needing to manually instrument the code being evaluated, as long as the code being evaluated
739+
is already adequately instrumented with OpenTelemetry. In the case of PydanticAI agents, for example, this can be used
740+
to ensure specific tools are (or are not) called during the execution of specific cases.
741+
742+
Using OpenTelemetry in this way also means that all data used to evaluate the task executions will be accessible in
743+
the traces produced by production runs of the code, making it straightforward to perform the same evaluations on
744+
production data.
584 KB
Loading

docs/img/logfire-evals-case.png

524 KB
Loading
445 KB
Loading

0 commit comments

Comments
 (0)