|
| 1 | +--- |
| 2 | +title: How to read experiment results locally |
| 3 | +sidebarTitle: Read experiment results locally |
| 4 | +--- |
| 5 | + |
| 6 | +When running [evaluations](/langsmith/evaluation-concepts), you may want to process results programmatically in your script rather than viewing them in the [LangSmith UI](https://smith.langchain.com). This is useful for scenarios like: |
| 7 | + |
| 8 | +- **CI/CD pipelines**: Implement quality gates that fail builds if evaluation scores drop below a threshold. |
| 9 | +- **Local debugging**: Inspect and analyze results without API calls. |
| 10 | +- **Custom aggregations**: Calculate metrics and statistics using your own logic. |
| 11 | +- **Integration testing**: Use evaluation results to gate merges or deployments. |
| 12 | + |
| 13 | +This guide shows you how to iterate over and process [experiment](/langsmith/evaluation-concepts#experiment) results from the @[`ExperimentResults`][ExperimentResults] object returned by @[`Client.evaluate()`][Client.evaluate]. |
| 14 | + |
| 15 | +<Note> |
| 16 | +This page focuses on processing results programmatically while still uploading them to LangSmith. |
| 17 | + |
| 18 | +If you want to run evaluations locally **without** recording anything to LangSmith (for quick testing or validation), refer to [Run an evaluation locally](/langsmith/local) which uses `upload_results=False`. |
| 19 | +</Note> |
| 20 | + |
| 21 | +## Iterate over evaluation results |
| 22 | + |
| 23 | +The @[`evaluate()`][Client.evaluate] function returns an @[`ExperimentResults`][ExperimentResults] object that you can iterate over. The `blocking` parameter controls when results become available: |
| 24 | + |
| 25 | +- `blocking=False`: Returns immediately with an iterator that yields results as they're produced. This allows you to process results in real-time as the evaluation runs. |
| 26 | +- `blocking=True` (default): Blocks until all evaluations complete before returning. When you iterate over the results, all data is already available. |
| 27 | + |
| 28 | +Both modes return the same `ExperimentResults` type; the difference is whether the function waits for completion before returning. Use `blocking=False` for streaming and real-time debugging, or `blocking=True` for batch processing when you need the complete dataset. |
| 29 | + |
| 30 | +The following example demonstrates `blocking=False`. It iterates over results as they stream in, collects them in a list, then processes them in a separate loop: |
| 31 | + |
| 32 | +```python |
| 33 | +from langsmith import Client |
| 34 | +import random |
| 35 | + |
| 36 | +client = Client() |
| 37 | + |
| 38 | +def target(inputs): |
| 39 | + """Your application or LLM chain""" |
| 40 | + return {"output": "MY OUTPUT"} |
| 41 | + |
| 42 | +def evaluator(run, example): |
| 43 | + """Your evaluator function""" |
| 44 | + return {"key": "randomness", "score": random.randint(0, 1)} |
| 45 | + |
| 46 | +# Run evaluation with blocking=False to get an iterator |
| 47 | +streamed_results = client.evaluate( |
| 48 | + target, |
| 49 | + data="MY_DATASET_NAME", |
| 50 | + evaluators=[evaluator], |
| 51 | + blocking=False |
| 52 | +) |
| 53 | + |
| 54 | +# Collect results as they stream in |
| 55 | +aggregated_results = [] |
| 56 | +for result in streamed_results: |
| 57 | + aggregated_results.append(result) |
| 58 | + |
| 59 | +# Separate loop to avoid logging at the same time as logs from evaluate() |
| 60 | +for result in aggregated_results: |
| 61 | + print("Input:", result["run"].inputs) |
| 62 | + print("Output:", result["run"].outputs) |
| 63 | + print("Evaluation Results:", result["evaluation_results"]["results"]) |
| 64 | + print("--------------------------------") |
| 65 | +``` |
| 66 | + |
| 67 | +This produces output like: |
| 68 | + |
| 69 | +``` |
| 70 | +Input: {'input': 'MY INPUT'} |
| 71 | +Output: {'output': 'MY OUTPUT'} |
| 72 | +Evaluation Results: [EvaluationResult(key='randomness', score=1, value=None, comment=None, correction=None, evaluator_info={}, feedback_config=None, source_run_id=UUID('7ebb4900-91c0-40b0-bb10-f2f6a451fd3c'), target_run_id=None, extra=None)] |
| 73 | +-------------------------------- |
| 74 | +``` |
| 75 | + |
| 76 | +## Understand the result structure |
| 77 | + |
| 78 | +Each result in the iterator contains: |
| 79 | + |
| 80 | +- `result["run"]`: The execution of your target function. |
| 81 | + - `result["run"].inputs`: The inputs from your [dataset](/langsmith/evaluation-concepts#datasets) example. |
| 82 | + - `result["run"].outputs`: The outputs produced by your target function. |
| 83 | + - `result["run"].id`: The unique ID for this run. |
| 84 | + |
| 85 | +- `result["evaluation_results"]["results"]`: A list of `EvaluationResult` objects, one per evaluator. |
| 86 | + - `key`: The metric name (from your evaluator's return value). |
| 87 | + - `score`: The numeric score (typically 0-1 or boolean). |
| 88 | + - `comment`: Optional explanatory text. |
| 89 | + - `source_run_id`: The ID of the evaluator run. |
| 90 | + |
| 91 | +- `result["example"]`: The dataset example that was evaluated. |
| 92 | + - `result["example"].inputs`: The input values. |
| 93 | + - `result["example"].outputs`: The reference outputs (if any). |
| 94 | + |
| 95 | +## Examples |
| 96 | + |
| 97 | +### Implement a quality gate |
| 98 | + |
| 99 | +This example uses evaluation results to pass or fail a CI/CD build automatically based on quality thresholds. The script iterates through results, calculates an average accuracy score, and exits with a non-zero status code if the accuracy falls below 85%. This ensures that you can deploy code changes that meet quality standards. |
| 100 | + |
| 101 | +```python |
| 102 | +from langsmith import Client |
| 103 | +import sys |
| 104 | + |
| 105 | +client = Client() |
| 106 | + |
| 107 | +def my_application(inputs): |
| 108 | + # Your application logic |
| 109 | + return {"response": "..."} |
| 110 | + |
| 111 | +def accuracy_evaluator(run, example): |
| 112 | + # Your evaluation logic |
| 113 | + is_correct = run.outputs["response"] == example.outputs["expected"] |
| 114 | + return {"key": "accuracy", "score": 1 if is_correct else 0} |
| 115 | + |
| 116 | +# Run evaluation |
| 117 | +results = client.evaluate( |
| 118 | + my_application, |
| 119 | + data="my_test_dataset", |
| 120 | + evaluators=[accuracy_evaluator], |
| 121 | + blocking=False |
| 122 | +) |
| 123 | + |
| 124 | +# Calculate aggregate metrics |
| 125 | +total_score = 0 |
| 126 | +count = 0 |
| 127 | + |
| 128 | +for result in results: |
| 129 | + eval_result = result["evaluation_results"]["results"][0] |
| 130 | + total_score += eval_result.score |
| 131 | + count += 1 |
| 132 | + |
| 133 | +average_accuracy = total_score / count |
| 134 | + |
| 135 | +print(f"Average accuracy: {average_accuracy:.2%}") |
| 136 | + |
| 137 | +# Fail the build if accuracy is too low |
| 138 | +if average_accuracy < 0.85: |
| 139 | + print("❌ Evaluation failed! Accuracy below 85% threshold.") |
| 140 | + sys.exit(1) |
| 141 | + |
| 142 | +print("✅ Evaluation passed!") |
| 143 | +``` |
| 144 | + |
| 145 | +### Batch processing with blocking=True |
| 146 | + |
| 147 | +When you need to perform operations that require the complete dataset (like calculating percentiles, sorting by score, or generating summary reports), use `blocking=True` to wait for all evaluations to complete before processing: |
| 148 | + |
| 149 | +```python |
| 150 | +# Run evaluation and wait for all results |
| 151 | +results = client.evaluate( |
| 152 | + target, |
| 153 | + data=dataset, |
| 154 | + evaluators=[evaluator], |
| 155 | + blocking=True # Wait for all evaluations to complete |
| 156 | +) |
| 157 | + |
| 158 | +# Process all results after evaluation completes |
| 159 | +for result in results: |
| 160 | + print("Input:", result["run"].inputs) |
| 161 | + print("Output:", result["run"].outputs) |
| 162 | + |
| 163 | + # Access individual evaluation results |
| 164 | + for eval_result in result["evaluation_results"]["results"]: |
| 165 | + print(f" {eval_result.key}: {eval_result.score}") |
| 166 | +``` |
| 167 | + |
| 168 | +With `blocking=True`, your processing code runs only after all evaluations are complete, avoiding mixed output with evaluation logs. |
| 169 | + |
| 170 | +For more information on running evaluations without uploading results, refer to [Run an evaluation locally](/langsmith/local). |
| 171 | + |
| 172 | +## Related |
| 173 | + |
| 174 | +- [Evaluate your LLM application](/langsmith/evaluate-llm-application) |
| 175 | +- [Run an evaluation locally](/langsmith/local) |
| 176 | +- [Fetch performance metrics from an experiment](/langsmith/fetch-perf-metrics-experiment) |
0 commit comments