Skip to content
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion config/_default/menus/main.en.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -344,7 +344,7 @@ menu:
url: agent/supported_platforms/heroku/
weight: 405
parent: agent_supported_platforms
- name: MacOS
- name: MacOS
identifier: basic_agent_usage_osx
url: agent/supported_platforms/osx/
weight: 406
Expand Down Expand Up @@ -4780,6 +4780,11 @@ menu:
parent: llm_obs
identifier: llm_obs_evaluations
weight: 4
- name: Custom LLM-as-a-Judge
url: llm_observability/evaluations/custom_llm_as_a_judge_evaluations
parent: llm_obs_evaluations
identifier: llm_obs_custom_llm_as_a_judge_evaluations
weight: 400
- name: Managed
url: llm_observability/evaluations/managed_evaluations
parent: llm_obs_evaluations
Expand Down
5 changes: 5 additions & 0 deletions config/_default/menus/main.es.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4770,6 +4770,11 @@ menu:
parent: llm_obs
identifier: llm_obs_evaluations
weight: 4
- name: Custom LLM-as-a-Judge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to change these, just the english one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

url: llm_observability/evaluations/custom_llm_as_a_judge_evaluations
parent: llm_obs_evaluations
identifier: llm_obs_custom_llm_as_a_judge_evaluations
weight: 400
- name: Managed
url: llm_observability/evaluations/managed_evaluations
parent: llm_obs_evaluations
Expand Down
5 changes: 5 additions & 0 deletions config/_default/menus/main.fr.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4770,6 +4770,11 @@ menu:
parent: llm_obs
identifier: llm_obs_evaluations
weight: 4
- name: Custom LLM-as-a-Judge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

url: llm_observability/evaluations/custom_llm_as_a_judge_evaluations
parent: llm_obs_evaluations
identifier: llm_obs_custom_llm_as_a_judge_evaluations
weight: 400
- name: Managed
url: llm_observability/evaluations/managed_evaluations
parent: llm_obs_evaluations
Expand Down
5 changes: 5 additions & 0 deletions config/_default/menus/main.ja.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4770,6 +4770,11 @@ menu:
parent: llm_obs
identifier: llm_obs_evaluations
weight: 4
- name: Custom LLM-as-a-Judge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

url: llm_observability/evaluations/custom_llm_as_a_judge_evaluations
parent: llm_obs_evaluations
identifier: llm_obs_custom_llm_as_a_judge_evaluations
weight: 400
- name: Managed
url: llm_observability/evaluations/managed_evaluations
parent: llm_obs_evaluations
Expand Down
5 changes: 5 additions & 0 deletions config/_default/menus/main.ko.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4770,6 +4770,11 @@ menu:
parent: llm_obs
identifier: llm_obs_evaluations
weight: 4
- name: Custom LLM-as-a-Judge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

url: llm_observability/evaluations/custom_llm_as_a_judge_evaluations
parent: llm_obs_evaluations
identifier: llm_obs_custom_llm_as_a_judge_evaluations
weight: 400
- name: Managed
url: llm_observability/evaluations/managed_evaluations
parent: llm_obs_evaluations
Expand Down
31 changes: 18 additions & 13 deletions content/en/llm_observability/evaluations/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,34 +8,39 @@ aliases:

## Overview

LLM Observability offers several ways to support evaluations. They can be configured by navigating to [**AI Observability > Settings > Evaluations**][7].
LLM Observability offers several ways to support evaluations. They can be configured by navigating to [**AI Observability > Settings > Evaluations**][8].

### Custom LLM-as-a-Judge Evaluations

[Custom LLM-as-a-Judge Evaluations][1] allow you to define your own evaluation logic using natural language prompts. You can create custom evaluations to assess subjective or objective criteria—like tone, helpfulness, or factuality—and run them at scale across your traces and spans.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spaces around dashes criteria - like tone, helpfulness, or factuality - and

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


### Managed Evaluations

Datadog builds and supports [Managed Evaluations][1] to support common use cases. You can enable and configure them within the LLM Observability application.
Datadog builds and supports [Managed Evaluations][2] to support common use cases. You can enable and configure them within the LLM Observability application.

### Submit External Evaluations

You can also submit [External Evaluations][2] using Datadog's API. This mechanism is great if you have your own evaluation system, but would like to centralize that information within Datadog.
You can also submit [External Evaluations][3] using Datadog's API. This mechanism is great if you have your own evaluation system, but would like to centralize that information within Datadog.

### Evaluation Integrations

Datadog also supports integrations with some 3rd party evaluation frameworks, such as [Ragas][3] and [NeMo][4].
Datadog also supports integrations with some 3rd party evaluation frameworks, such as [Ragas][4] and [NeMo][5].

### Sensitive Data Scanner integration

In addition to evaluating the input and output of LLM requests, agents, workflows, or the application, LLM Observability integrates with [Sensitive Data Scanner][5], which helps prevent data leakage by identifying and redacting any sensitive information (such as personal data, financial details, or proprietary information) that may be present in any step of your LLM application.
In addition to evaluating the input and output of LLM requests, agents, workflows, or the application, LLM Observability integrates with [Sensitive Data Scanner][6], which helps prevent data leakage by identifying and redacting any sensitive information (such as personal data, financial details, or proprietary information) that may be present in any step of your LLM application.

By proactively scanning for sensitive data, LLM Observability ensures that conversations remain secure and compliant with data protection regulations. This additional layer of security reinforces Datadog's commitment to maintaining the confidentiality and integration of user interactions with LLMs.

### Permissions

[LLM Observability Write permissions][6] are necessary to configure evaluations.
[LLM Observability Write permissions][7] are necessary to configure evaluations.

[1]: /llm_observability/evaluations/managed_evaluations
[2]: /llm_observability/evaluations/external_evaluations
[3]: /llm_observability/evaluations/ragas_evaluations
[4]: /llm_observability/evaluations/submit_nemo_evaluations
[5]: /security/sensitive_data_scanner/
[6]: /account_management/rbac/permissions/#llm-observability
[7]: https://app.datadoghq.com/llm/settings/evaluations
[1]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations
[2]: /llm_observability/evaluations/managed_evaluations
[3]: /llm_observability/evaluations/external_evaluations
[4]: /llm_observability/evaluations/ragas_evaluations
[5]: /llm_observability/evaluations/submit_nemo_evaluations
[6]: /security/sensitive_data_scanner/
[7]: /account_management/rbac/permissions/#llm-observability
[8]: https://app.datadoghq.com/llm/settings/evaluations
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
---
title: Custom LLM-as-a-Judge Evaluations
description: Learn how to create Custom LLM-as-a-judge Evaluations.
further_reading:
- link: "/llm_observability/terms/"
tag: "Documentation"
text: "Learn about LLM Observability terms and concepts"
- link: "/llm_observability/setup"
tag: "Documentation"
text: "Learn how to set up LLM Observability"
- link: "/llm_observability/evaluations/managed_evaluations"
tag: "Documentation"
text: "Learn about Managed Evaluations"
- link: "https://www.datadoghq.com/blog/llm-evaluation-framework-best-practices/"
tag: "Blog"
text: "Building an LLM evaluation framework: best practices"
---

## Overview

Custom LLM-as-a-Judge Evaluations let you define your own evaluation logic to automatically assess your LLM applications. You can use natural language prompts to capture subjective or objective criteria—like tone, helpfulness, or factuality—and run them at scale across your traces and spans.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

criteria - like tone, helpfulness, or factuality - and

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


This provides a flexible, automated way to monitor model quality, detect regressions, and track improvements over time.

## How it works

Custom LLM-as-a-Judge Evaluations use an LLM to judge the performance of another LLM.

You define:
- The criteria (via prompt text)
- What is evaluated (e.g., a span's output)
- The model (e.g., GPT-4o)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: GPT-4o - format as code using backticks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

- The output type (boolean, numeric score, or categorical label)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same format boolean, score and categorical as code since they refer to code concepts in our app

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


Datadog then runs this evaluation logic automatically against your spans, recording results as structured metrics that you can query, visualize, and monitor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe structured evaluation metrics. "metrics" means this metric at datadog, we brand these things as evaluations on the UI or in the query language

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the wording to just "...recording results for you to query, visualize, and monitor."


## Create a custom evaluation

You can create and manage custom evaluations from the [Evaluations page][1] in LLM Observability.

{{< img src="llm_observability/evaluations/custom_llm_judge_1.png" alt="Begin creating your own Custom LLM-as-a-judge Evaluation by opening the Create Evaluation side panel from the Evaluations page" style="width:100%;" >}}


### 1. Name your evaluation

Give your evaluation a clear, descriptive name (e.g., `factuality-check` or `tone-eval`). You will use this name later when querying evaluation results. The name has to be unique within your application.

### 2. Choose an LLM provider and model

Select your LLM account and the model you wish to use for the evaluation. If you do not have an LLM account already integrated with LLM Observability, follow these instructions to [connect an LLM provider][2].

### 3. Define the evaluation prompt

In the **Evaluation Prompt** section, you can either:
- Use preexisting prompt templates, including:
- Failure to Answer
- Prompt Injection
- Sentiment
- Topic Relevancy
- Toxicity
- Create an evaluation from scratch by writing your own criteria.

Templates can be used as-is or modified to match your specific evaluation logic.

#### Writing a custom prompt

In the **System Prompt** field, write clear instructions describing what the evaluator should assess.

- Focus on a single evaluation goal
- Include 2–3 few-shot examples showing input/output pairs, expected results, and reasoning.

In the **User Prompt** field, explicitly specify what parts of the span to evaluate: Span Input (`{{span_input}}`), Output (`{{span_output}}`), or both.

**Example System Prompt:**

{{< code-block lang="text" >}}
You will be looking at interactions between a user and a budgeting AI agent. Your job is to classify the user's intent when it comes to using the budgeting AI agent.

You will be given a Span Input, which represents the user's message to the agent, which you will then classify. Here are some examples.

Span Input: What are the core things I should know about budgeting?
Classification: general_financial_advice

Span Input: Did I go over budget with my grocery bills last month?
Classification: budgeting_question

Span Input: What is the category for which I have the highest budget?
Classification: budgeting_question

Span Input: Based on my past months, what is my ideal budget for subscriptions?
Classification: budgeting_advice

Span Input: Raise my restaurant budget by $50
Classification: budgeting_request

Span Input: Help me plan a trip to the Maldives
Classification: unrelated
{{< /code-block >}}

**Example User Message:**

{{< code-block lang="text" >}}
Span Input: {{span_input}}
{{< /code-block >}}

### 4. Choose an output type

Define the expected output schema for the evaluator:

- **Boolean** – True/False results (e.g., "Did the model follow instructions?")
- **Score** – Numeric rating (e.g., 1–5 scale for helpfulness)
- **Categorical** – Discrete labels (e.g., "Good", "Bad", "Neutral")

The schema ensures your results are structured for querying and dashboarding. For Anthropic and Bedrock models, only Boolean output types are allowed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should mention

  1. the fact that structured output is openAIs structured output and it needs to be edited by the user
  2. the keyword search logic for anthropic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added


You can preview and refine your logic in the [**Test Evaluation**](#6-test-your-evaluation) panel by providing sample span input/output and clicking **Run Evaluation** to verify outputs.

### 5. Configure filters and sampling

Choose which application and spans to evaluate:

- **Traces** – Evaluate only root spans
- **All Spans** – Include all spans
- **Span Names** – Target spans by name
- **Tags** – Limit evaluation to spans with certain tags

Optionally, apply sampling (for example, 10%) to control evaluation cost.

### 6. Test your evaluation

Use the **Test Evaluation** panel on the right to preview how your evaluator performs.

You can input sample `{{span_input}}` and `{{span_output}}` values, then click **Run Evaluation** to see the LLM-as-a-Judge's output before saving. Modify your evaluation until you are satisfied with the results.

{{< img src="llm_observability/evaluations/custom_llm_judge_2.png" alt="The Test Evaluation panel allows you to preview your evaluation before saving." style="width:100%;" >}}


## Viewing and using results

After an evaluation has been saved, it will automatically run on targeted spans and results will be available across LLM Observability in near-realtime. Custom LLM-as-a-judge results for a specific span can be found in the **Evaluations** tab next to all other evaluations.

{{< img src="llm_observability/evaluations/custom_llm_judge_3.png" alt="View custom evaluation results alongside managed evaluations in the Evaluations tab of a trace" style="width:100%;" >}}

Use the syntax `@evaluations.custom.<evaluation_name>` to query or visualize results.

For example:
```
@evaluations.custom.helpfulness-check
```

{{< img src="llm_observability/evaluations/custom_llm_judge_4.png" alt="Filter and query traces using custom evaluation results in the LLM Observability Traces page" style="width:100%;" >}}


You can:
- Filter traces by evaluation results
- Use evaluation results as [facets][5]
- View aggregate results in the LLM Observability Overview page's Evaluation section
- Create [monitors][6] to alert on performance changes or regression

## Best practices for reliable custom evaluations

- **Start small**: Target a single, well-defined failure mode before scaling.
- **Iterate**: Run, inspect outputs, and refine your prompt.
- **Validate**: Periodically check evaluator accuracy using sampled traces.
- **Document your rubric**: Clearly define what "Pass" and "Fail" mean to avoid drift over time.
- **Re-align your evaluator**: Reassess prompt and few-shot examples when the underlying LLM updates.

For more resources on best practices, check out [Building an LLM evaluation framework: best practices][3] and [Using LLM-as-a-judge for automated and versatile evaluation][4].

## Further Reading

{{< partial name="whats-next/whats-next.html" >}}

[1]: https://app.datadoghq.com/llm/settings/evaluations
[2]: /llm_observability/evaluations/managed_evaluations#connect-your-llm-provider-account
[3]: https://www.datadoghq.com/blog/llm-evaluation-framework-best-practices/
[4]: https://huggingface.co/learn/cookbook/llm_judge
[5]: https://docs.datadoghq.com/service_management/events/explorer/facets/
[6]: https://docs.datadoghq.com/monitors/

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading