From 9673b3b27c9500f42af2d34318fae06626ef5b7a Mon Sep 17 00:00:00 2001 From: Jennifer Mickel Date: Tue, 7 Oct 2025 11:15:47 -0400 Subject: [PATCH 1/5] just moved managed_evaluations into its on folder --- .../{managed_evaluations.md => managed_evaluations/_index.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename content/en/llm_observability/evaluations/{managed_evaluations.md => managed_evaluations/_index.md} (100%) diff --git a/content/en/llm_observability/evaluations/managed_evaluations.md b/content/en/llm_observability/evaluations/managed_evaluations/_index.md similarity index 100% rename from content/en/llm_observability/evaluations/managed_evaluations.md rename to content/en/llm_observability/evaluations/managed_evaluations/_index.md From 3fc77b03dd7220f45f5627352a29c9efcfcc4ffd Mon Sep 17 00:00:00 2001 From: Jennifer Mickel Date: Tue, 7 Oct 2025 15:26:29 -0400 Subject: [PATCH 2/5] added intial version of the agent_evals.md file and modified the index and other things --- config/_default/menus/main.en.yaml | 5 + .../evaluations/managed_evaluations/_index.md | 204 -------------- .../managed_evaluations/agent_evals.md | 265 ++++++++++++++++++ 3 files changed, 270 insertions(+), 204 deletions(-) create mode 100644 content/en/llm_observability/evaluations/managed_evaluations/agent_evals.md diff --git a/config/_default/menus/main.en.yaml b/config/_default/menus/main.en.yaml index 25ae18f61dd17..e59a834758618 100644 --- a/config/_default/menus/main.en.yaml +++ b/config/_default/menus/main.en.yaml @@ -4785,6 +4785,11 @@ menu: parent: llm_obs_evaluations identifier: llm_obs_managed_evaluations weight: 401 + - name: Agent + url: llm_observability/evaluations/managed_evaluations/agent_evals + parent: llm_obs_evaluations + identifier: llm_obs_managed_evaluations_agent + weight: 501 - name: Ragas url: llm_observability/evaluations/ragas_evaluations parent: llm_obs_evaluations diff --git a/content/en/llm_observability/evaluations/managed_evaluations/_index.md b/content/en/llm_observability/evaluations/managed_evaluations/_index.md index 6b43067a0ca42..0a7634c48336f 100644 --- a/content/en/llm_observability/evaluations/managed_evaluations/_index.md +++ b/content/en/llm_observability/evaluations/managed_evaluations/_index.md @@ -259,210 +259,6 @@ This check helps understand the overall mood of the conversation, gauge user sat |---|---|---| | Evaluated on Input and Output | Evaluated using LLM | Sentiment flags the emotional tone or attitude expressed in the text, categorizing it as positive, negative, or neutral. | -#### Goal completeness - -This check evaluates whether your LLM chatbot can successfully carry out a full session by effectively meeting the user's needs from start to finish. This completeness measure serves as a proxy for gauging user satisfaction over the course of a multi-turn interaction and is especially valuable for LLM chatbot applications. - -{{< img src="llm_observability/evaluations/goal_completeness.png" alt="A Goal Completeness evaluation detected by an LLM in LLM Observability" style="width:100%;" >}} - -| Evaluation Stage | Evaluation Method | Evaluation Definition | -|---|---|---| -| Evaluated on session | Evaluated using LLM | Goal Completeness assesses whether all user intentions within a multi-turn interaction were successfully resolved. The evaluation identifies resolved and unresolved intentions, providing a completeness score based on the ratio of unresolved to total intentions. | - -For optimal evaluation accuracy and cost control, it is preferable to send a tag when the session is finished and configure the evaluation to run only on session with this tag. The evaluation returns a detailed breakdown including resolved intentions, unresolved intentions, and reasoning for the assessment. A session is considered incomplete if more than 50% of identified intentions remain unresolved. - -##### Instrumentation - -To enable Goal Completeness evaluation, you need to instrument your application to track sessions and their completion status. This evaluation works by analyzing complete sessions to determine if all user intentions were successfully addressed. - -The evaluation requires sending a span with a specific tag when the session ends. This signal allows the evaluation to identify session boundaries and trigger the completeness assessment: - -{{< code-block lang="python" >}} -from ddtrace.llmobs import LLMObs -from ddtrace.llmobs.decorators import llm - -# Call this function whenever your session has ended -@llm(model_name="model_name", model_provider="model_provider") -def send_session_ended_span(input_data, output_data) -> None: - """Send a span to indicate the chat session has ended.""" - LLMObs.annotate( - input_data=input_data, - output_data=output_data, - tags={"session_status": "completed"} - ) -{{< /code-block >}} - -Replace `session_status` and `completed` with your preferred tag key and value. - -The span should contain meaningful `input_data` and `output_data` that represent the final state of the session. This helps the evaluation understand the session's context and outcomes when assessing completeness. - -##### Goal completeness configuration - -After instrumenting your application to send session-end spans, configure the evaluation to run only on sessions with your specific tag. This targeted approach ensures the evaluation analyzes complete sessions rather than partial interactions. - -1. Go to the **Goal Completeness** settings -2. Configure the evaluation data: - - Select **spans** as the data type since Goal Completeness runs on LLM spans which contains the full session history. - - Choose the tag name associated with the span that corresponds to your session-end function (for example, `send_session_ended_span`). - - In the **tags** section, specify the tag you configured in your instrumentation (for example, `session_status:completed`). - -This configuration ensures evaluations run only on complete sessions. This provides accurate assessments of user intention resolution. - -#### Tool selection - -This check evaluates whether the agent has successfully selected the appropriate tools to address the user's request. - -{{< img src="llm_observability/evaluations/tool_selection_failure.png" alt="A tool selection failure detected by the evaluation in LLM Observability" style="width:100%;" >}} - -| Evaluation Stage | Evaluation Method | Evaluation Definition | -|---|---|---| -| Evaluated on LLM spans| Evaluated using LLM | Tool Selection verifies that the tools chosen by the LLM align with the user's request and the available tools. The evaluation identifies cases where irrelevant or incorrect tool calls were made.| - -##### Instrumentation - -This evaluation is supported in dd-trace version 3.12 and above. The example below uses the OpenAI Agents SDK to illustrate how tools are made available to the agent and to the evaluation: - -{{< code-block lang="python" >}} -from ddtrace.llmobs import LLMObs -from agents import Agent, ModelSettings, function_tool - -@function_tool -def add_numbers(a: int, b: int) -> int: - """ - Adds two numbers together. - """ - return a + b - -@function_tool -def subtract_numbers(a: int, b: int) -> int: - """ - Subtracts two numbers. - """ - return a - b - - -# List of tools available to the agent -math_tutor_agent = Agent( - name="Math Tutor", - handoff_description="Specialist agent for math questions", - instructions="You provide help with math problems. Please use the tools to find the answer.", - model="o3-mini", - tools=[ - add_numbers, subtract_numbers - ], -) - -history_tutor_agent = Agent( - name="History Tutor", - handoff_description="Specialist agent for history questions", - instructions="You provide help with history problems.", - model="o3-mini", -) - -# The triage agent decides which specialized agent to hand off the task to — another type of tool selection covered by this evaluation. -triage_agent = Agent( - 'openai:gpt-4o', - model_settings=ModelSettings(temperature=0), - instructions='What is the sum of 1 to 10?', - handoffs=[math_tutor_agent, history_tutor_agent], -) -{{< /code-block >}} - -#### Tool argument correctness - -This check looks at the arguments provided to a selected tool, and it evaluates whether these arguments match the expected type and make sense given the tool's context. - -{{< img src="llm_observability/evaluations/tool_argument_correctness_error.png" alt="A tool argument correctness error detected by the evaluation in LLM Observability" style="width:100%;" >}} - -| Evaluation Stage | Evaluation Method | Evaluation Definition | -|---|---|---| -| Evaluated on LLM spans| Evaluated using LLM | Tool Argument Correctness verifies that the arguments provided to a tool by the LLM are correct and contextually relevant. This evaluation identifies cases where the arguments provided to the tool are incorrect according to the tool schema (for example: the argument is expected to be an integer rather than a string) and are not relevant (for example: the argument is a country, but the model provides the name of a city).| - -##### Instrumentation - -This evaluation is supported in `dd-trace` v3.12+. The example below uses the OpenAI Agents SDK to illustrate how tools are made available to the agent and to the evaluation: - -{{< code-block lang="python" >}} -import os - -from ddtrace.llmobs import LLMObs -from pydantic_ai import Agent - - -# Define tools as regular functions with type hints -def add_numbers(a: int, b: int) -> int: - """ - Adds two numbers together. - """ - return a + b - - -def subtract_numbers(a: int, b: int) -> int: - """ - Subtracts two numbers. - """ - return a - b - - -def multiply_numbers(a: int, b: int) -> int: - """ - Multiplies two numbers. - """ - return a * b - - -def divide_numbers(a: int, b: int) -> float: - """ - Divides two numbers. - """ - return a / b - - -# Enable LLMObs -LLMObs.enable( - ml_app="jenn_test", - api_key=os.environ["DD_API_KEY"], - site=os.environ["DD_SITE"], - agentless_enabled=True, -) - - -# Create the Math Tutor agent with tools -math_tutor_agent = Agent( - 'openai:gpt-5-nano', - instructions="You provide help with math problems. Please use the tools to find the answer.", - tools=[add_numbers, subtract_numbers, multiply_numbers, divide_numbers], -) - -# Create the History Tutor agent (note: gpt-5-nano doesn't exist, using gpt-4o-mini) -history_tutor_agent = Agent( - 'openai:gpt-5-nano', - instructions="You provide help with history problems.", -) - -# Create the triage agent -# Note: pydantic_ai handles handoffs differently - you'd typically use result_type -# or custom logic to route between agents -triage_agent = Agent( - 'openai:gpt-5-nano', - instructions=( - 'DO NOT RELY ON YOUR OWN MATHEMATICAL KNOWLEDGE, ' - 'MAKE SURE TO CALL AVAILABLE TOOLS TO SOLVE EVERY SUBPROBLEM.' - ), - tools=[add_numbers, subtract_numbers, multiply_numbers, divide_numbers], -) - - -# Run the agent synchronously -result = triage_agent.run_sync( - ''' - Help me solve the following problem: - What is the sum of the numbers between 1 and 100? - Make sure you list out all the mathematical operations (addition, subtraction, multiplication, division) in order before you start calling tools in that order. - ''' -) -{{< /code-block >}} - ### Security and Safety evaluations #### Toxicity diff --git a/content/en/llm_observability/evaluations/managed_evaluations/agent_evals.md b/content/en/llm_observability/evaluations/managed_evaluations/agent_evals.md new file mode 100644 index 0000000000000..b1e0db8f6da0a --- /dev/null +++ b/content/en/llm_observability/evaluations/managed_evaluations/agent_evals.md @@ -0,0 +1,265 @@ +--- +title: Agent Evaluations +description: Learn how to configure managed evaluations for your LLM applications. +further_reading: +- link: "/llm_observability/terms/" + tag: "Documentation" + text: "Learn about LLM Observability terms and concepts" +- link: "/llm_observability/setup" + tag: "Documentation" + text: "Learn how to set up LLM Observability" +- link: "https://www.datadoghq.com/blog/llm-observability-hallucination-detection/" + tag: "Blog" + text: "Detect hallucinations in your RAG LLM applications with Datadog LLM Observability" +aliases: + - /llm_observability/evaluations/ootb_evaluations +--- + +Agent evaluations help ensure that your LLM-powered applications are making the right tool calls and resolving user requests successfully. These checks are designed to catch common failure modes when agents interact with external tools, APIs, or workflows. + + +#### Tool Selection + +This evaluation checks whether the agent successfully selected the appropriate tools to address the user’s request. Incorrect or irrelevant tool choices lead to wasted calls, higher latency, and failed tasks. + +### Evaluation Summary + +| **Span kind** | **Method** | **Definition** | +|---|---|---| +| Evaluated on **LLM spans**| Evaluated using LLM | Verifies that the tools chosen by the LLM align with the user’s request and the set of available tools. Flags irrelevant or incorrect tool calls. | + +### Example + +{{< img src="llm_observability/evaluations/tool_selection_failure.png" alt="A tool selection failure detected by the evaluation in LLM Observability" style="width:100%;" >}} + +### How to use +1. Ensure you are running `dd-trace` v3.12+. +1. Instrument your agent with available tools. The example below uses the OpenAI Agents SDK to illustrate how tools are made available to the agent and to the evaluation: +1. Enable the `ToolSelection` evaluation in the Datadog UI <> + +This evaluation is supported in dd-trace version 3.12 and above. The example below uses the OpenAI Agents SDK to illustrate how tools are made available to the agent and to the evaluation: + +{{< code-block lang="python" >}} +from ddtrace.llmobs import LLMObs +from agents import Agent, ModelSettings, function_tool + +@function_tool +def add_numbers(a: int, b: int) -> int: + """ + Adds two numbers together. + """ + return a + b + +@function_tool +def subtract_numbers(a: int, b: int) -> int: + """ + Subtracts two numbers. + """ + return a - b + + +# List of tools available to the agent +math_tutor_agent = Agent( + name="Math Tutor", + handoff_description="Specialist agent for math questions", + instructions="You provide help with math problems. Please use the tools to find the answer.", + model="o3-mini", + tools=[ + add_numbers, subtract_numbers + ], +) + +history_tutor_agent = Agent( + name="History Tutor", + handoff_description="Specialist agent for history questions", + instructions="You provide help with history problems.", + model="o3-mini", +) + +# The triage agent decides which specialized agent to hand off the task to — another type of tool selection covered by this evaluation. +triage_agent = Agent( + 'openai:gpt-4o', + model_settings=ModelSettings(temperature=0), + instructions='What is the sum of 1 to 10?', + handoffs=[math_tutor_agent, history_tutor_agent], +) +{{< /code-block >}} + +### Troubleshooting + +- If you frequently see irrelevant tool calls, review your tool descriptions—they may be too vague for the LLM to distinguish. +- Make sure you include descriptions of the tools (i.e. the quotes containing the tool description under the function name, the sdk autoparses this as the description) +- <> + +#### Tool Argument Correctness + +Even if the right tool is selected, the arguments passed to it must be valid and contextually relevant. Incorrect argument formats (e.g., a string instead of an integer) or irrelevant values cause failures in downstream execution. + +### Evaluation summary + +| **Span kind** | **Method** | **Definition** | +|---|---|---| +| Evaluated on **LLM spans** | Evaluated using LLM | Verifies that arguments provided to a tool are correct and relevant based on the tool schema. Identifies invalid or irrelevant arguments. | + +### Example + +{{< img src="llm_observability/evaluations/tool_argument_correctness_error.png" alt="A tool argument correctness error detected by the evaluation in LLM Observability" style="width:100%;" >}} + +##### Instrumentation + +This evaluation is supported in `dd-trace` v3.12+. The example below uses the OpenAI Agents SDK to illustrate how tools are made available to the agent and to the evaluation: + +### How to use + +1. Install `dd-trace` v3.12+. +1. Instrument your agent with .... The example below uses the OpenAI Agents SDK to illustrate how tools are made available to the agent and to the evaluation: + +Enable the ToolArgumentCorrectness evaluation in the Datadog UI <> + +{{< code-block lang="python" >}} +import os + +from ddtrace.llmobs import LLMObs +from pydantic_ai import Agent + + +# Define tools as regular functions with type hints +def add_numbers(a: int, b: int) -> int: + """ + Adds two numbers together. + """ + return a + b + + +def subtract_numbers(a: int, b: int) -> int: + """ + Subtracts two numbers. + """ + return a - b + + +def multiply_numbers(a: int, b: int) -> int: + """ + Multiplies two numbers. + """ + return a * b + + +def divide_numbers(a: int, b: int) -> float: + """ + Divides two numbers. + """ + return a / b + + +# Enable LLMObs +LLMObs.enable( + ml_app="jenn_test", + api_key=os.environ["DD_API_KEY"], + site=os.environ["DD_SITE"], + agentless_enabled=True, +) + + +# Create the Math Tutor agent with tools +math_tutor_agent = Agent( + 'openai:gpt-5-nano', + instructions="You provide help with math problems. Please use the tools to find the answer.", + tools=[add_numbers, subtract_numbers, multiply_numbers, divide_numbers], +) + +# Create the History Tutor agent (note: gpt-5-nano doesn't exist, using gpt-4o-mini) +history_tutor_agent = Agent( + 'openai:gpt-5-nano', + instructions="You provide help with history problems.", +) + +# Create the triage agent +# Note: pydantic_ai handles handoffs differently - you'd typically use result_type +# or custom logic to route between agents +triage_agent = Agent( + 'openai:gpt-5-nano', + instructions=( + 'DO NOT RELY ON YOUR OWN MATHEMATICAL KNOWLEDGE, ' + 'MAKE SURE TO CALL AVAILABLE TOOLS TO SOLVE EVERY SUBPROBLEM.' + ), + tools=[add_numbers, subtract_numbers, multiply_numbers, divide_numbers], +) + + +# Run the agent synchronously +result = triage_agent.run_sync( + ''' + Help me solve the following problem: + What is the sum of the numbers between 1 and 100? + Make sure you list out all the mathematical operations (addition, subtraction, multiplication, division) in order before you start calling tools in that order. + ''' +) +{{< /code-block >}} + +### Troubleshooting +- Make sure your tools use type hints—the evaluation relies on schema definitions. +- Make sure to include a tool description (i.e. the description in quotes under the function name), this is used in the auto-instrumentation process to parse the tool’s schema +- Validate that your LLM prompt includes enough context for correct argument construction. + + +#### Goal Completeness + +An agent can call tools correctly but still fail to achieve the user’s intended goal. This evaluation checks whether your LLM chatbot can successfully carry out a full session by effectively meeting the user’s needs from start to finish. This completeness measure serves as a proxy for gauging user satisfaction over the course of a multi-turn interaction and is especially valuable for LLM chatbot applications. + +### Evaluation summary +| **Span kind** | **Method** | **Definition** | +|---|---|---| +| Evaluated on LLLM spans | Evaluated using LLM | Checks whether the agent resolved the user’s intent by analyzing full session spans. Runs only on sessions marked as completed. + + | + +### Example +{{< img src="llm_observability/evaluations/goal_completeness.png" alt="A Goal Completeness evaluation detected by an LLM in LLM Observability" style="width:100%;" >}} + + +##### How to Use + +To enable Goal Completeness evaluation, you need to instrument your application to track sessions and their completion status. This evaluation works by analyzing complete sessions to determine if all user intentions were successfully addressed. + +The evaluation requires sending a span with a specific tag when the session ends. This signal allows the evaluation to identify session boundaries and trigger the completeness assessment: + +For optimal evaluation accuracy and cost control, it is preferable to send a tag when the session is finished and configure the evaluation to run only on session with this tag. The evaluation returns a detailed breakdown including resolved intentions, unresolved intentions, and reasoning for the assessment. A session is considered incomplete if more than 50% of identified intentions remain unresolved. + + +{{< code-block lang="python" >}} +from ddtrace.llmobs import LLMObs +from ddtrace.llmobs.decorators import llm + +# Call this function whenever your session has ended +@llm(model_name="model_name", model_provider="model_provider") +def send_session_ended_span(input_data, output_data) -> None: + """Send a span to indicate the chat session has ended.""" + LLMObs.annotate( + input_data=input_data, + output_data=output_data, + tags={"session_status": "completed"} + ) +{{< /code-block >}} + +Replace `session_status` and `completed` with your preferred tag key and value. + +The span should contain meaningful `input_data` and `output_data` that represent the final state of the session. This helps the evaluation understand the session's context and outcomes when assessing completeness. + +##### Goal completeness configuration + +After instrumenting your application to send session-end spans, configure the evaluation to run only on sessions with your specific tag. This targeted approach ensures the evaluation analyzes complete sessions rather than partial interactions. + +1. Go to the **Goal Completeness** settings +2. Configure the evaluation data: + - Select **spans** as the data type since Goal Completeness runs on LLM spans which contains the full session history. + - Choose the tag name associated with the span that corresponds to your session-end function (for example, `send_session_ended_span`). + - In the **tags** section, specify the tag you configured in your instrumentation (for example, `session_status:completed`). + +This configuration ensures evaluations run only on complete sessions. This provides accurate assessments of user intention resolution. + +### Troubleshooting +- If evaluations are skipped, check that you are tagging session-end spans correctly. +- Ensure your agent is configured to signal the end of a user request cycle. + + From 1c765651da5ec416f9b39425af4f3e6cfffab29b Mon Sep 17 00:00:00 2001 From: Jennifer Mickel Date: Tue, 7 Oct 2025 16:21:27 -0400 Subject: [PATCH 3/5] fix somethign --- .../evaluations/managed_evaluations/agent_evals.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/content/en/llm_observability/evaluations/managed_evaluations/agent_evals.md b/content/en/llm_observability/evaluations/managed_evaluations/agent_evals.md index b1e0db8f6da0a..129f6992539ab 100644 --- a/content/en/llm_observability/evaluations/managed_evaluations/agent_evals.md +++ b/content/en/llm_observability/evaluations/managed_evaluations/agent_evals.md @@ -210,9 +210,7 @@ An agent can call tools correctly but still fail to achieve the user’s intende ### Evaluation summary | **Span kind** | **Method** | **Definition** | |---|---|---| -| Evaluated on LLLM spans | Evaluated using LLM | Checks whether the agent resolved the user’s intent by analyzing full session spans. Runs only on sessions marked as completed. - - | +| Evaluated on LLLM spans | Evaluated using LLM | Checks whether the agent resolved the user’s intent by analyzing full session spans. Runs only on sessions marked as completed. | ### Example {{< img src="llm_observability/evaluations/goal_completeness.png" alt="A Goal Completeness evaluation detected by an LLM in LLM Observability" style="width:100%;" >}} From 14c89ecf85043e53a0989c38db4075c321a44353 Mon Sep 17 00:00:00 2001 From: Jennifer Mickel Date: Wed, 8 Oct 2025 09:48:58 -0400 Subject: [PATCH 4/5] changing the weight --- config/_default/menus/main.en.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/config/_default/menus/main.en.yaml b/config/_default/menus/main.en.yaml index e59a834758618..19f6b6e07185d 100644 --- a/config/_default/menus/main.en.yaml +++ b/config/_default/menus/main.en.yaml @@ -4789,7 +4789,7 @@ menu: url: llm_observability/evaluations/managed_evaluations/agent_evals parent: llm_obs_evaluations identifier: llm_obs_managed_evaluations_agent - weight: 501 + weight: 40101 - name: Ragas url: llm_observability/evaluations/ragas_evaluations parent: llm_obs_evaluations From 969f7c5447fbb441a7f1f2d6c633bd6238e7eb23 Mon Sep 17 00:00:00 2001 From: Jennifer Mickel Date: Wed, 8 Oct 2025 09:51:02 -0400 Subject: [PATCH 5/5] updated the parent --- config/_default/menus/main.en.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/config/_default/menus/main.en.yaml b/config/_default/menus/main.en.yaml index 19f6b6e07185d..6d6f98d3b39bf 100644 --- a/config/_default/menus/main.en.yaml +++ b/config/_default/menus/main.en.yaml @@ -4787,7 +4787,7 @@ menu: weight: 401 - name: Agent url: llm_observability/evaluations/managed_evaluations/agent_evals - parent: llm_obs_evaluations + parent: llm_obs_managed_evaluations identifier: llm_obs_managed_evaluations_agent weight: 40101 - name: Ragas