Skip to content

Commit d452822

Browse files
gary-huangCFLJacquetOliviaShoup
authored
add docs on summary evals (#31941)
* add docs on summary evals * Update _index.md * Update content/en/llm_observability/experiments/_index.md * Update content/en/llm_observability/experiments/_index.md * Update content/en/llm_observability/experiments/_index.md --------- Co-authored-by: Charles Jacquet <[email protected]> Co-authored-by: Olivia Shoup <[email protected]>
1 parent 5503f05 commit d452822

File tree

1 file changed

+21
-7
lines changed
  • content/en/llm_observability/experiments

1 file changed

+21
-7
lines changed

content/en/llm_observability/experiments/_index.md

Lines changed: 21 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -23,13 +23,9 @@ LLM Observability [Experiments][9] supports the entire lifecycle of building LLM
2323
Install Datadog's LLM Observability Python SDK:
2424

2525
```shell
26-
pip install ddtrace>=3.14.0
26+
pip install ddtrace>=3.15.0
2727
```
2828

29-
### Cookbooks
30-
31-
To see in-depth examples of what you can do with LLM Experiments, you can check these [jupyter notebooks][10]
32-
3329
### Setup
3430

3531
Enable LLM Observability:
@@ -221,6 +217,9 @@ Evaluators are functions that measure how well the model or agent performs by co
221217
- score: returns a numeric value (float)
222218
- categorical: returns a labeled category (string)
223219

220+
### Summary Evaluators
221+
Summary Evaluators are optionally defined functions that measure how well the model or agent performs, by providing an aggregated score against the entire dataset, outputs, and evaluation results. The supported evaluator types are the same as above.
222+
224223
### Creating an experiment
225224

226225
1. Load a dataset
@@ -266,7 +265,17 @@ Evaluators are functions that measure how well the model or agent performs by co
266265
return fake_llm_call
267266
```
268267
Evaluator functions can take any non-null type as `input_data` (string, number, Boolean, object, array); `output_data` and `expected_output` can be any type.
269-
Evaluators can only return a string, number, Boolean.
268+
Evaluators can only return a string, a number, or a Boolean.
269+
270+
5. (Optional) Define summary evaluator function(s).
271+
272+
```python
273+
def num_exact_matches(inputs, outputs, expected_outputs, evaluators_results):
274+
return evaluators_results["exact_match"].count(True)
275+
276+
```
277+
If defined and provided to the experiment, summary evaluator functions are executed after evaluators have finished running. Summary evaluator functions can take a list of any non-null type as `inputs` (string, number, Boolean, object, array); `outputs` and `expected_outputs` can be lists of any type. `evaluators_results` is a dictionary of list of results from evaluators, keyed by the name of the evaluator function. For example, in the above code snippet the summary evaluator `num_exact_matches` uses the results (a list of Booleans) from the `exact_match` evaluator to provide a count of number of exact matches.
278+
Summary evaluators can only return a string, a number, or a Boolean.
270279

271280
6. Create and run the experiment.
272281
```python
@@ -275,6 +284,7 @@ Evaluators are functions that measure how well the model or agent performs by co
275284
task=task,
276285
dataset=dataset,
277286
evaluators=[exact_match, overlap, fake_llm_as_a_judge],
287+
summary_evaluators=[num_exact_matches], # optional
278288
description="Testing capital cities knowledge",
279289
config={
280290
"model_name": "gpt-4",
@@ -286,7 +296,7 @@ Evaluators are functions that measure how well the model or agent performs by co
286296
results = experiment.run() # Run on all dataset records
287297

288298
# Process results
289-
for result in results:
299+
for result in results.get("rows", []):
290300
print(f"Record {result['idx']}")
291301
print(f"Input: {result['input']}")
292302
print(f"Output: {result['output']}")
@@ -352,6 +362,10 @@ jobs:
352362
DD_APP_KEY: ${{ secrets.DD_APP_KEY }}
353363
```
354364
365+
## Cookbooks
366+
367+
To see in-depth examples of what you can do with LLM Experiments, you can check these [jupyter notebooks][10]
368+
355369
## HTTP API
356370
357371
### Postman quickstart

0 commit comments

Comments
 (0)