Skip to content

Commit ebbf315

Browse files
authored
Gen AI Evaluation Result (#2563)
1 parent 5666960 commit ebbf315

File tree

5 files changed

+147
-28
lines changed

5 files changed

+147
-28
lines changed
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
change_type: enhancement
2+
component: gen-ai
3+
note: |
4+
Introducing `Evaluation Event` in GenAI Semantic Conventions to represent and capture evaluation results.
5+
6+
issues: [2563]

docs/gen-ai/gen-ai-events.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ linkTitle: Events
99
<!-- toc -->
1010

1111
- [Event: `event.gen_ai.client.inference.operation.details`](#event-eventgen_aiclientinferenceoperationdetails)
12+
- [Event: `event.gen_ai.evaluation.result`](#event-eventgen_aievaluationresult)
1213

1314
<!-- tocstop -->
1415

@@ -209,4 +210,51 @@ section for more details.
209210
<!-- END AUTOGENERATED TEXT -->
210211
<!-- endsemconv -->
211212

213+
## Event: `event.gen_ai.evaluation.result`
214+
215+
<!-- semconv event.gen_ai.evaluation.result -->
216+
<!-- NOTE: THIS TEXT IS AUTOGENERATED. DO NOT EDIT BY HAND. -->
217+
<!-- see templates/registry/markdown/snippet.md.j2 -->
218+
<!-- prettier-ignore-start -->
219+
<!-- markdownlint-capture -->
220+
<!-- markdownlint-disable -->
221+
222+
**Status:** ![Development](https://img.shields.io/badge/-development-blue)
223+
224+
The event name MUST be `gen_ai.evaluation.result`.
225+
226+
This event captures the result of evaluating GenAI output for quality, accuracy, or other characteristics. This event SHOULD be parented to GenAI operation span being evaluated when possible or set `gen_ai.response.id` when span id is not available.
227+
228+
| Attribute | Type | Description | Examples | [Requirement Level](https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) | Stability |
229+
|---|---|---|---|---|---|
230+
| [`gen_ai.evaluation.name`](/docs/registry/attributes/gen-ai.md) | string | The name of the evaluation metric used for the GenAI response. | `Relevance`; `IntentResolution` | `Required` | ![Development](https://img.shields.io/badge/-development-blue) |
231+
| [`error.type`](/docs/registry/attributes/error.md) | string | Describes a class of error the operation ended with. [1] | `timeout`; `java.net.UnknownHostException`; `server_certificate_invalid`; `500` | `Conditionally Required` if the operation ended in an error | ![Stable](https://img.shields.io/badge/-stable-lightgreen) |
232+
| [`gen_ai.evaluation.score.label`](/docs/registry/attributes/gen-ai.md) | string | Human readable label for evaluation. [2] | `relevant`; `not_relevant`; `correct`; `incorrect`; `pass`; `fail` | `Conditionally Required` if applicable | ![Development](https://img.shields.io/badge/-development-blue) |
233+
| [`gen_ai.evaluation.score.value`](/docs/registry/attributes/gen-ai.md) | double | The evaluation score returned by the evaluator. | `4.0` | `Conditionally Required` if applicable | ![Development](https://img.shields.io/badge/-development-blue) |
234+
| [`gen_ai.evaluation.explanation`](/docs/registry/attributes/gen-ai.md) | string | A free-form explanation for the assigned score provided by the evaluator. | `The response is factually accurate but lacks sufficient detail to fully address the question.` | `Recommended` | ![Development](https://img.shields.io/badge/-development-blue) |
235+
| [`gen_ai.response.id`](/docs/registry/attributes/gen-ai.md) | string | The unique identifier for the completion. [3] | `chatcmpl-123` | `Recommended` when available | ![Development](https://img.shields.io/badge/-development-blue) |
236+
237+
**[1] `error.type`:** The `error.type` SHOULD match the error code returned by the Generative AI Evaluation provider or the client library,
238+
the canonical name of exception that occurred, or another low-cardinality error identifier.
239+
Instrumentations SHOULD document the list of errors they report.
240+
241+
**[2] `gen_ai.evaluation.score.label`:** This attribute provides a human-readable interpretation of the evaluation score produced by an evaluator. For example, a score value of 1 could mean "relevant" in one evaluation system and "not relevant" in another, depending on the scoring range and evaluator. The label SHOULD have low cardinality. Possible values depend on the evaluation metric and evaluator used; implementations SHOULD document the possible values.
242+
243+
**[3] `gen_ai.response.id`:** The unique identifier assigned to the specific
244+
completion being evaluated. This attribute helps correlate the evaluation
245+
event with the corresponding operation when span id is not available.
246+
247+
---
248+
249+
`error.type` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.
250+
251+
| Value | Description | Stability |
252+
|---|---|---|
253+
| `_OTHER` | A fallback error value to be used when the instrumentation doesn't define a custom value. | ![Stable](https://img.shields.io/badge/-stable-lightgreen) |
254+
255+
<!-- markdownlint-restore -->
256+
<!-- prettier-ignore-end -->
257+
<!-- END AUTOGENERATED TEXT -->
258+
<!-- endsemconv -->
259+
212260
[DocumentStatus]: https://opentelemetry.io/docs/specs/otel/document-status

0 commit comments

Comments
 (0)