diff --git a/.chloggen/users_singankit_gen_ai_evaluation_result_event.yaml b/.chloggen/users_singankit_gen_ai_evaluation_result_event.yaml new file mode 100644 index 0000000000..f74495ff81 --- /dev/null +++ b/.chloggen/users_singankit_gen_ai_evaluation_result_event.yaml @@ -0,0 +1,6 @@ +change_type: enhancement +component: gen-ai +note: | + Introducing `Evaluation Event` in GenAI Semantic Conventions to represent and capture evaluation results. + +issues: [2563] diff --git a/docs/gen-ai/gen-ai-events.md b/docs/gen-ai/gen-ai-events.md index dc57b5ac48..10eb8730c4 100644 --- a/docs/gen-ai/gen-ai-events.md +++ b/docs/gen-ai/gen-ai-events.md @@ -9,6 +9,7 @@ linkTitle: Events - [Event: `event.gen_ai.client.inference.operation.details`](#event-eventgen_aiclientinferenceoperationdetails) +- [Event: `event.gen_ai.evaluation.result`](#event-eventgen_aievaluationresult) @@ -209,4 +210,51 @@ section for more details. +## Event: `event.gen_ai.evaluation.result` + + + + + + + + +**Status:** ![Development](https://img.shields.io/badge/-development-blue) + +The event name MUST be `gen_ai.evaluation.result`. + +This event captures the result of evaluating GenAI output for quality, accuracy, or other characteristics. This event SHOULD be parented to GenAI operation span being evaluated when possible or set `gen_ai.response.id` when span id is not available. + +| Attribute | Type | Description | Examples | [Requirement Level](https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) | Stability | +|---|---|---|---|---|---| +| [`gen_ai.evaluation.name`](/docs/registry/attributes/gen-ai.md) | string | The name of the evaluation metric used for the GenAI response. | `Relevance`; `IntentResolution` | `Required` | ![Development](https://img.shields.io/badge/-development-blue) | +| [`error.type`](/docs/registry/attributes/error.md) | string | Describes a class of error the operation ended with. [1] | `timeout`; `java.net.UnknownHostException`; `server_certificate_invalid`; `500` | `Conditionally Required` if the operation ended in an error | ![Stable](https://img.shields.io/badge/-stable-lightgreen) | +| [`gen_ai.evaluation.score.label`](/docs/registry/attributes/gen-ai.md) | string | Human readable label for evaluation. [2] | `relevant`; `not_relevant`; `correct`; `incorrect`; `pass`; `fail` | `Conditionally Required` if applicable | ![Development](https://img.shields.io/badge/-development-blue) | +| [`gen_ai.evaluation.score.value`](/docs/registry/attributes/gen-ai.md) | double | The evaluation score returned by the evaluator. | `4.0` | `Conditionally Required` if applicable | ![Development](https://img.shields.io/badge/-development-blue) | +| [`gen_ai.evaluation.explanation`](/docs/registry/attributes/gen-ai.md) | string | A free-form explanation for the assigned score provided by the evaluator. | `The response is factually accurate but lacks sufficient detail to fully address the question.` | `Recommended` | ![Development](https://img.shields.io/badge/-development-blue) | +| [`gen_ai.response.id`](/docs/registry/attributes/gen-ai.md) | string | The unique identifier for the completion. [3] | `chatcmpl-123` | `Recommended` when available | ![Development](https://img.shields.io/badge/-development-blue) | + +**[1] `error.type`:** The `error.type` SHOULD match the error code returned by the Generative AI Evaluation provider or the client library, +the canonical name of exception that occurred, or another low-cardinality error identifier. +Instrumentations SHOULD document the list of errors they report. + +**[2] `gen_ai.evaluation.score.label`:** This attribute provides a human-readable interpretation of the evaluation score produced by an evaluator. For example, a score value of 1 could mean "relevant" in one evaluation system and "not relevant" in another, depending on the scoring range and evaluator. The label SHOULD have low cardinality. Possible values depend on the evaluation metric and evaluator used; implementations SHOULD document the possible values. + +**[3] `gen_ai.response.id`:** The unique identifier assigned to the specific +completion being evaluated. This attribute helps correlate the evaluation +event with the corresponding operation when span id is not available. + +--- + +`error.type` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used. + +| Value | Description | Stability | +|---|---|---| +| `_OTHER` | A fallback error value to be used when the instrumentation doesn't define a custom value. | ![Stable](https://img.shields.io/badge/-stable-lightgreen) | + + + + + + [DocumentStatus]: https://opentelemetry.io/docs/specs/otel/document-status diff --git a/docs/registry/attributes/gen-ai.md b/docs/registry/attributes/gen-ai.md index b62a376cb9..f0be533342 100644 --- a/docs/registry/attributes/gen-ai.md +++ b/docs/registry/attributes/gen-ai.md @@ -19,13 +19,17 @@ This document defines the attributes used to describe telemetry in the context o | `gen_ai.conversation.id` | string | The unique identifier for a conversation (session, thread), used to store and correlate messages within this conversation. | `conv_5j66UpCpwteGg4YSxUnt7lPY` | ![Development](https://img.shields.io/badge/-development-blue) | | `gen_ai.data_source.id` | string | The data source identifier. [1] | `H7STPQYOND` | ![Development](https://img.shields.io/badge/-development-blue) | | `gen_ai.embeddings.dimension.count` | int | The number of dimensions the resulting output embeddings should have. | `512`; `1024` | ![Development](https://img.shields.io/badge/-development-blue) | -| `gen_ai.input.messages` | any | The chat history provided to the model as an input. [2] | [
  {
    "role": "user",
    "parts": [
      {
        "type": "text",
        "content": "Weather in Paris?"
      }
    ]
  },
  {
    "role": "assistant",
    "parts": [
      {
        "type": "tool_call",
        "id": "call_VSPygqKTWdrhaFErNvMV18Yl",
        "name": "get_weather",
        "arguments": {
          "location": "Paris"
        }
      }
    ]
  },
  {
    "role": "tool",
    "parts": [
      {
        "type": "tool_call_response",
        "id": " call_VSPygqKTWdrhaFErNvMV18Yl",
        "result": "rainy, 57°F"
      }
    ]
  }
] | ![Development](https://img.shields.io/badge/-development-blue) | -| `gen_ai.operation.name` | string | The name of the operation being performed. [3] | `chat`; `generate_content`; `text_completion` | ![Development](https://img.shields.io/badge/-development-blue) | -| `gen_ai.output.messages` | any | Messages returned by the model where each message represents a specific model response (choice, candidate). [4] | [
  {
    "role": "assistant",
    "parts": [
      {
        "type": "text",
        "content": "The weather in Paris is currently rainy with a temperature of 57°F."
      }
    ],
    "finish_reason": "stop"
  }
] | ![Development](https://img.shields.io/badge/-development-blue) | -| `gen_ai.output.type` | string | Represents the content type requested by the client. [5] | `text`; `json`; `image` | ![Development](https://img.shields.io/badge/-development-blue) | -| `gen_ai.provider.name` | string | The Generative AI provider as identified by the client or server instrumentation. [6] | `openai`; `gcp.gen_ai`; `gcp.vertex_ai` | ![Development](https://img.shields.io/badge/-development-blue) | +| `gen_ai.evaluation.explanation` | string | A free-form explanation for the assigned score provided by the evaluator. | `The response is factually accurate but lacks sufficient detail to fully address the question.` | ![Development](https://img.shields.io/badge/-development-blue) | +| `gen_ai.evaluation.name` | string | The name of the evaluation metric used for the GenAI response. | `Relevance`; `IntentResolution` | ![Development](https://img.shields.io/badge/-development-blue) | +| `gen_ai.evaluation.score.label` | string | Human readable label for evaluation. [2] | `relevant`; `not_relevant`; `correct`; `incorrect`; `pass`; `fail` | ![Development](https://img.shields.io/badge/-development-blue) | +| `gen_ai.evaluation.score.value` | double | The evaluation score returned by the evaluator. | `4.0` | ![Development](https://img.shields.io/badge/-development-blue) | +| `gen_ai.input.messages` | any | The chat history provided to the model as an input. [3] | [
  {
    "role": "user",
    "parts": [
      {
        "type": "text",
        "content": "Weather in Paris?"
      }
    ]
  },
  {
    "role": "assistant",
    "parts": [
      {
        "type": "tool_call",
        "id": "call_VSPygqKTWdrhaFErNvMV18Yl",
        "name": "get_weather",
        "arguments": {
          "location": "Paris"
        }
      }
    ]
  },
  {
    "role": "tool",
    "parts": [
      {
        "type": "tool_call_response",
        "id": " call_VSPygqKTWdrhaFErNvMV18Yl",
        "result": "rainy, 57°F"
      }
    ]
  }
] | ![Development](https://img.shields.io/badge/-development-blue) | +| `gen_ai.operation.name` | string | The name of the operation being performed. [4] | `chat`; `generate_content`; `text_completion` | ![Development](https://img.shields.io/badge/-development-blue) | +| `gen_ai.output.messages` | any | Messages returned by the model where each message represents a specific model response (choice, candidate). [5] | [
  {
    "role": "assistant",
    "parts": [
      {
        "type": "text",
        "content": "The weather in Paris is currently rainy with a temperature of 57°F."
      }
    ],
    "finish_reason": "stop"
  }
] | ![Development](https://img.shields.io/badge/-development-blue) | +| `gen_ai.output.type` | string | Represents the content type requested by the client. [6] | `text`; `json`; `image` | ![Development](https://img.shields.io/badge/-development-blue) | +| `gen_ai.provider.name` | string | The Generative AI provider as identified by the client or server instrumentation. [7] | `openai`; `gcp.gen_ai`; `gcp.vertex_ai` | ![Development](https://img.shields.io/badge/-development-blue) | | `gen_ai.request.choice.count` | int | The target number of candidate completions to return. | `3` | ![Development](https://img.shields.io/badge/-development-blue) | -| `gen_ai.request.encoding_formats` | string[] | The encoding formats requested in an embeddings operation, if specified. [7] | `["base64"]`; `["float", "binary"]` | ![Development](https://img.shields.io/badge/-development-blue) | +| `gen_ai.request.encoding_formats` | string[] | The encoding formats requested in an embeddings operation, if specified. [8] | `["base64"]`; `["float", "binary"]` | ![Development](https://img.shields.io/badge/-development-blue) | | `gen_ai.request.frequency_penalty` | double | The frequency penalty setting for the GenAI request. | `0.1` | ![Development](https://img.shields.io/badge/-development-blue) | | `gen_ai.request.max_tokens` | int | The maximum number of tokens the model generates for a request. | `100` | ![Development](https://img.shields.io/badge/-development-blue) | | `gen_ai.request.model` | string | The name of the GenAI model a request is being made to. | `gpt-4` | ![Development](https://img.shields.io/badge/-development-blue) | @@ -38,18 +42,20 @@ This document defines the attributes used to describe telemetry in the context o | `gen_ai.response.finish_reasons` | string[] | Array of reasons the model stopped generating tokens, corresponding to each generation received. | `["stop"]`; `["stop", "length"]` | ![Development](https://img.shields.io/badge/-development-blue) | | `gen_ai.response.id` | string | The unique identifier for the completion. | `chatcmpl-123` | ![Development](https://img.shields.io/badge/-development-blue) | | `gen_ai.response.model` | string | The name of the model that generated the response. | `gpt-4-0613` | ![Development](https://img.shields.io/badge/-development-blue) | -| `gen_ai.system_instructions` | any | The system message or instructions provided to the GenAI model separately from the chat history. [8] | [
  {
    "type": "text",
    "content": "You are an Agent that greet users, always use greetings tool to respond"
  }
]; [
  {
    "type": "text",
    "content": "You are a language translator."
  },
  {
    "type": "text",
    "content": "Your mission is to translate text in English to French."
  }
] | ![Development](https://img.shields.io/badge/-development-blue) | +| `gen_ai.system_instructions` | any | The system message or instructions provided to the GenAI model separately from the chat history. [9] | [
  {
    "type": "text",
    "content": "You are an Agent that greet users, always use greetings tool to respond"
  }
]; [
  {
    "type": "text",
    "content": "You are a language translator."
  },
  {
    "type": "text",
    "content": "Your mission is to translate text in English to French."
  }
] | ![Development](https://img.shields.io/badge/-development-blue) | | `gen_ai.token.type` | string | The type of token being counted. | `input`; `output` | ![Development](https://img.shields.io/badge/-development-blue) | | `gen_ai.tool.call.id` | string | The tool call identifier. | `call_mszuSIzqtI65i1wAUOE8w5H4` | ![Development](https://img.shields.io/badge/-development-blue) | | `gen_ai.tool.description` | string | The tool description. | `Multiply two numbers` | ![Development](https://img.shields.io/badge/-development-blue) | | `gen_ai.tool.name` | string | Name of the tool utilized by the agent. | `Flights` | ![Development](https://img.shields.io/badge/-development-blue) | -| `gen_ai.tool.type` | string | Type of the tool utilized by the agent [9] | `function`; `extension`; `datastore` | ![Development](https://img.shields.io/badge/-development-blue) | +| `gen_ai.tool.type` | string | Type of the tool utilized by the agent [10] | `function`; `extension`; `datastore` | ![Development](https://img.shields.io/badge/-development-blue) | | `gen_ai.usage.input_tokens` | int | The number of tokens used in the GenAI input (prompt). | `100` | ![Development](https://img.shields.io/badge/-development-blue) | | `gen_ai.usage.output_tokens` | int | The number of tokens used in the GenAI response (completion). | `180` | ![Development](https://img.shields.io/badge/-development-blue) | **[1] `gen_ai.data_source.id`:** Data sources are used by AI agents and RAG applications to store grounding data. A data source may be an external database, object store, document collection, website, or any other storage system used by the GenAI agent or application. The `gen_ai.data_source.id` SHOULD match the identifier used by the GenAI system rather than a name specific to the external storage, such as a database or object store. Semantic conventions referencing `gen_ai.data_source.id` MAY also leverage additional attributes, such as `db.*`, to further identify and describe the data source. -**[2] `gen_ai.input.messages`:** Instrumentations MUST follow [Input messages JSON schema](/docs/gen-ai/gen-ai-input-messages.json). +**[2] `gen_ai.evaluation.score.label`:** This attribute provides a human-readable interpretation of the evaluation score produced by an evaluator. For example, a score value of 1 could mean "relevant" in one evaluation system and "not relevant" in another, depending on the scoring range and evaluator. The label SHOULD have low cardinality. Possible values depend on the evaluation metric and evaluator used; implementations SHOULD document the possible values. + +**[3] `gen_ai.input.messages`:** Instrumentations MUST follow [Input messages JSON schema](/docs/gen-ai/gen-ai-input-messages.json). When the attribute is recorded on events, it MUST be recorded in structured form. When recorded on spans, it MAY be recorded as a JSON string if structured format is not supported and SHOULD be recorded in structured form otherwise. @@ -64,9 +70,9 @@ input messages. See [Recording content on attributes](/docs/gen-ai/gen-ai-spans.md#recording-content-on-attributes) section for more details. -**[3] `gen_ai.operation.name`:** If one of the predefined values applies, but specific system uses a different name it's RECOMMENDED to document it in the semantic conventions for specific GenAI system and use system-specific name in the instrumentation. If a different name is not documented, instrumentation libraries SHOULD use applicable predefined value. +**[4] `gen_ai.operation.name`:** If one of the predefined values applies, but specific system uses a different name it's RECOMMENDED to document it in the semantic conventions for specific GenAI system and use system-specific name in the instrumentation. If a different name is not documented, instrumentation libraries SHOULD use applicable predefined value. -**[4] `gen_ai.output.messages`:** Instrumentations MUST follow [Output messages JSON schema](/docs/gen-ai/gen-ai-output-messages.json) +**[5] `gen_ai.output.messages`:** Instrumentations MUST follow [Output messages JSON schema](/docs/gen-ai/gen-ai-output-messages.json) Each message represents a single output choice/candidate generated by the model. Each message corresponds to exactly one generation @@ -86,11 +92,11 @@ output messages. See [Recording content on attributes](/docs/gen-ai/gen-ai-spans.md#recording-content-on-attributes) section for more details. -**[5] `gen_ai.output.type`:** This attribute SHOULD be used when the client requests output of a specific type. The model may return zero or more outputs of this type. +**[6] `gen_ai.output.type`:** This attribute SHOULD be used when the client requests output of a specific type. The model may return zero or more outputs of this type. This attribute specifies the output modality and not the actual output format. For example, if an image is requested, the actual output could be a URL pointing to an image file. Additional output format details may be recorded in the future in the `gen_ai.output.{type}.*` attributes. -**[6] `gen_ai.provider.name`:** The attribute SHOULD be set based on the instrumentation's best +**[7] `gen_ai.provider.name`:** The attribute SHOULD be set based on the instrumentation's best knowledge and may differ from the actual model provider. Multiple providers, including Azure OpenAI, Gemini, and AI hosting platforms @@ -109,9 +115,9 @@ should have the `gen_ai.provider.name` set to `aws.bedrock` and include applicable `aws.bedrock.*` attributes and are not expected to include `openai.*` attributes. -**[7] `gen_ai.request.encoding_formats`:** In some GenAI systems the encoding formats are called embedding types. Also, some GenAI systems only accept a single format per request. +**[8] `gen_ai.request.encoding_formats`:** In some GenAI systems the encoding formats are called embedding types. Also, some GenAI systems only accept a single format per request. -**[8] `gen_ai.system_instructions`:** This attribute SHOULD be used when the corresponding provider or API +**[9] `gen_ai.system_instructions`:** This attribute SHOULD be used when the corresponding provider or API allows to provide system instructions or messages separately from the chat history. @@ -132,7 +138,7 @@ system instructions. See [Recording content on attributes](/docs/gen-ai/gen-ai-spans.md#recording-content-on-attributes) section for more details. -**[9] `gen_ai.tool.type`:** Extension: A tool executed on the agent-side to directly call external APIs, bridging the gap between the agent and real-world systems. +**[10] `gen_ai.tool.type`:** Extension: A tool executed on the agent-side to directly call external APIs, bridging the gap between the agent and real-world systems. Agent-side operations involve actions that are performed by the agent on the server or within the agent's controlled environment. Function: A tool executed on the client-side, where the agent generates parameters for a predefined function, and the client executes the logic. Client-side operations are actions taken on the user's end or within the client application. @@ -175,9 +181,9 @@ Datastore: A tool used by the agent to access and query structured or unstructur | `azure.ai.openai` | [Azure OpenAI](https://azure.microsoft.com/products/ai-services/openai-service/) | ![Development](https://img.shields.io/badge/-development-blue) | | `cohere` | [Cohere](https://cohere.com/) | ![Development](https://img.shields.io/badge/-development-blue) | | `deepseek` | [DeepSeek](https://www.deepseek.com/) | ![Development](https://img.shields.io/badge/-development-blue) | -| `gcp.gemini` | [Gemini](https://cloud.google.com/products/gemini) [10] | ![Development](https://img.shields.io/badge/-development-blue) | -| `gcp.gen_ai` | Any Google generative AI endpoint [11] | ![Development](https://img.shields.io/badge/-development-blue) | -| `gcp.vertex_ai` | [Vertex AI](https://cloud.google.com/vertex-ai) [12] | ![Development](https://img.shields.io/badge/-development-blue) | +| `gcp.gemini` | [Gemini](https://cloud.google.com/products/gemini) [11] | ![Development](https://img.shields.io/badge/-development-blue) | +| `gcp.gen_ai` | Any Google generative AI endpoint [12] | ![Development](https://img.shields.io/badge/-development-blue) | +| `gcp.vertex_ai` | [Vertex AI](https://cloud.google.com/vertex-ai) [13] | ![Development](https://img.shields.io/badge/-development-blue) | | `groq` | [Groq](https://groq.com/) | ![Development](https://img.shields.io/badge/-development-blue) | | `ibm.watsonx.ai` | [IBM Watsonx AI](https://www.ibm.com/products/watsonx-ai) | ![Development](https://img.shields.io/badge/-development-blue) | | `mistral_ai` | [Mistral AI](https://mistral.ai/) | ![Development](https://img.shields.io/badge/-development-blue) | @@ -185,11 +191,11 @@ Datastore: A tool used by the agent to access and query structured or unstructur | `perplexity` | [Perplexity](https://www.perplexity.ai/) | ![Development](https://img.shields.io/badge/-development-blue) | | `x_ai` | [xAI](https://x.ai/) | ![Development](https://img.shields.io/badge/-development-blue) | -**[10]:** Used when accessing the 'generativelanguage.googleapis.com' endpoint. Also known as the AI Studio API. +**[11]:** Used when accessing the 'generativelanguage.googleapis.com' endpoint. Also known as the AI Studio API. -**[11]:** May be used when specific backend is unknown. +**[12]:** May be used when specific backend is unknown. -**[12]:** Used when accessing the 'aiplatform.googleapis.com' endpoint. +**[13]:** Used when accessing the 'aiplatform.googleapis.com' endpoint. --- @@ -226,20 +232,20 @@ Describes deprecated `gen_ai` attributes. | `azure.ai.openai` | Azure OpenAI | ![Development](https://img.shields.io/badge/-development-blue) | | `cohere` | Cohere | ![Development](https://img.shields.io/badge/-development-blue) | | `deepseek` | DeepSeek | ![Development](https://img.shields.io/badge/-development-blue) | -| `gcp.gemini` | Gemini [13] | ![Development](https://img.shields.io/badge/-development-blue) | -| `gcp.gen_ai` | Any Google generative AI endpoint [14] | ![Development](https://img.shields.io/badge/-development-blue) | -| `gcp.vertex_ai` | Vertex AI [15] | ![Development](https://img.shields.io/badge/-development-blue) | +| `gcp.gemini` | Gemini [14] | ![Development](https://img.shields.io/badge/-development-blue) | +| `gcp.gen_ai` | Any Google generative AI endpoint [15] | ![Development](https://img.shields.io/badge/-development-blue) | +| `gcp.vertex_ai` | Vertex AI [16] | ![Development](https://img.shields.io/badge/-development-blue) | | `groq` | Groq | ![Development](https://img.shields.io/badge/-development-blue) | | `ibm.watsonx.ai` | IBM Watsonx AI | ![Development](https://img.shields.io/badge/-development-blue) | | `mistral_ai` | Mistral AI | ![Development](https://img.shields.io/badge/-development-blue) | | `openai` | OpenAI | ![Development](https://img.shields.io/badge/-development-blue) | | `perplexity` | Perplexity | ![Development](https://img.shields.io/badge/-development-blue) | -**[13]:** This refers to the 'generativelanguage.googleapis.com' endpoint. Also known as the AI Studio API. May use common attributes prefixed with 'gcp.gen_ai.'. +**[14]:** This refers to the 'generativelanguage.googleapis.com' endpoint. Also known as the AI Studio API. May use common attributes prefixed with 'gcp.gen_ai.'. -**[14]:** May be used when specific backend is unknown. May use common attributes prefixed with 'gcp.gen_ai.'. +**[15]:** May be used when specific backend is unknown. May use common attributes prefixed with 'gcp.gen_ai.'. -**[15]:** This refers to the 'aiplatform.googleapis.com' endpoint. May use common attributes prefixed with 'gcp.gen_ai.'. +**[16]:** This refers to the 'aiplatform.googleapis.com' endpoint. May use common attributes prefixed with 'gcp.gen_ai.'. ## Deprecated OpenAI GenAI Attributes diff --git a/model/gen-ai/events.yaml b/model/gen-ai/events.yaml index bb4cde64cc..40b4051640 100644 --- a/model/gen-ai/events.yaml +++ b/model/gen-ai/events.yaml @@ -10,3 +10,37 @@ groups: This event is opt-in and could be used to store input and output details independently from traces. extends: attributes.gen_ai.inference.client + + - id: event.gen_ai.evaluation.result + name: gen_ai.evaluation.result + type: event + stability: development + brief: > + This event captures the result of evaluating GenAI output for quality, accuracy, or other characteristics. + This event SHOULD be parented to GenAI operation span being evaluated when possible + or set `gen_ai.response.id` when span id is not available. + attributes: + - ref: gen_ai.evaluation.name + requirement_level: required + - ref: gen_ai.evaluation.score.value + requirement_level: + conditionally_required: if applicable + - ref: gen_ai.evaluation.score.label + requirement_level: + conditionally_required: if applicable + - ref: gen_ai.evaluation.explanation + requirement_level: recommended + - ref: gen_ai.response.id + requirement_level: + recommended: when available + note: | + The unique identifier assigned to the specific + completion being evaluated. This attribute helps correlate the evaluation + event with the corresponding operation when span id is not available. + - ref: error.type + requirement_level: + conditionally_required: "if the operation ended in an error" + note: | + The `error.type` SHOULD match the error code returned by the Generative AI Evaluation provider or the client library, + the canonical name of exception that occurred, or another low-cardinality error identifier. + Instrumentations SHOULD document the list of errors they report. diff --git a/model/gen-ai/registry.yaml b/model/gen-ai/registry.yaml index 4d6fd74e2e..83b773272a 100644 --- a/model/gen-ai/registry.yaml +++ b/model/gen-ai/registry.yaml @@ -479,3 +479,28 @@ groups: "finish_reason": "stop" } ] + - id: gen_ai.evaluation.name + stability: development + type: string + brief: The name of the evaluation metric used for the GenAI response. + examples: ["Relevance", "IntentResolution"] + - id: gen_ai.evaluation.score.value + stability: development + type: double + brief: The evaluation score returned by the evaluator. + examples: [4.0] + - id: gen_ai.evaluation.score.label + stability: development + type: string + brief: Human readable label for evaluation. + examples: ["relevant", "not_relevant", "correct", "incorrect", "pass", "fail"] + note: > + This attribute provides a human-readable interpretation of the evaluation score produced by an evaluator. + For example, a score value of 1 could mean "relevant" in one evaluation system and "not relevant" in another, depending on the scoring range and evaluator. + The label SHOULD have low cardinality. + Possible values depend on the evaluation metric and evaluator used; implementations SHOULD document the possible values. + - id: gen_ai.evaluation.explanation + stability: development + type: string + brief: A free-form explanation for the assigned score provided by the evaluator. + examples: ["The response is factually accurate but lacks sufficient detail to fully address the question."]