add docs on summary evals (#31941)

gary-huang · CFLJacquet · OliviaShoup · web-flow · commit d452822eb030 · 2025-10-03T13:04:53.000-04:00
* add docs on summary evals

* Update _index.md

* Update content/en/llm_observability/experiments/_index.md

* Update content/en/llm_observability/experiments/_index.md

* Update content/en/llm_observability/experiments/_index.md

---------

Co-authored-by: Charles Jacquet &lt;charles.jacquet@datadoghq.com&gt;
Co-authored-by: Olivia Shoup &lt;116908616+OliviaShoup@users.noreply.github.com&gt;
diff --git a/content/en/llm_observability/experiments/_index.md b/content/en/llm_observability/experiments/_index.md
@@ -23,13 +23,9 @@ LLM Observability [Experiments][9] supports the entire lifecycle of building LLM
 Install Datadog's LLM Observability Python SDK:
 
 ```shell
-pip install ddtrace>=3.14.0
+pip install ddtrace>=3.15.0
 ```
 
-### Cookbooks
-
-To see in-depth examples of what you can do with LLM Experiments, you can check these [jupyter notebooks][10]
-
 ### Setup
 
 Enable LLM Observability:
@@ -221,6 +217,9 @@ Evaluators are functions that measure how well the model or agent performs by co
 - score: returns a numeric value (float)
 - categorical: returns a labeled category (string)
 
+### Summary Evaluators
+Summary Evaluators are optionally defined functions that measure how well the model or agent performs, by providing an aggregated score against the entire dataset, outputs, and evaluation results. The supported evaluator types are the same as above.
+
 ### Creating an experiment
 
 1. Load a dataset
@@ -266,7 +265,17 @@ Evaluators are functions that measure how well the model or agent performs by co
        return fake_llm_call
    ```  
    Evaluator functions can take any non-null type as `input_data` (string, number, Boolean, object, array); `output_data` and `expected_output` can be any type.  
-   Evaluators can only return a string, number, Boolean.  
+   Evaluators can only return a string, a number, or a Boolean.  
+
+5. (Optional) Define summary evaluator function(s).  
+   
+   ```python
+    def num_exact_matches(inputs, outputs, expected_outputs, evaluators_results):
+        return evaluators_results["exact_match"].count(True)
+
+   ```  
+   If defined and provided to the experiment, summary evaluator functions are executed after evaluators have finished running. Summary evaluator functions can take a list of any non-null type as `inputs` (string, number, Boolean, object, array); `outputs` and `expected_outputs` can be lists of any type. `evaluators_results` is a dictionary of list of results from evaluators, keyed by the name of the evaluator function. For example, in the above code snippet the summary evaluator `num_exact_matches` uses the results (a list of Booleans) from the `exact_match` evaluator to provide a count of number of exact matches.
+   Summary evaluators can only return a string, a number, or a Boolean.  
 
 6. Create and run the experiment.
    ```python
@@ -275,6 +284,7 @@ Evaluators are functions that measure how well the model or agent performs by co
        task=task,
        dataset=dataset,
        evaluators=[exact_match, overlap, fake_llm_as_a_judge],
+       summary_evaluators=[num_exact_matches], # optional
        description="Testing capital cities knowledge",
        config={
            "model_name": "gpt-4",
@@ -286,7 +296,7 @@ Evaluators are functions that measure how well the model or agent performs by co
    results = experiment.run()  # Run on all dataset records
 
    # Process results
-   for result in results:
+   for result in results.get("rows", []):
        print(f"Record {result['idx']}")
        print(f"Input: {result['input']}")
        print(f"Output: {result['output']}")
@@ -352,6 +362,10 @@ jobs:
           DD_APP_KEY: ${{ secrets.DD_APP_KEY }}
 ```
 
+## Cookbooks
+
+To see in-depth examples of what you can do with LLM Experiments, you can check these [jupyter notebooks][10]
+
 ## HTTP API
 
 ### Postman quickstart