-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Documentation for Custom LLM-as-a-judge Evaluations #32018
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 5 commits
5036d05
311fb99
34b1111
50fa009
14b2a7a
633e2ca
38a9343
d57dc01
556de54
2bb6ebe
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4770,6 +4770,11 @@ menu: | |
parent: llm_obs | ||
identifier: llm_obs_evaluations | ||
weight: 4 | ||
- name: Custom LLM-as-a-Judge | ||
|
||
url: llm_observability/evaluations/custom_llm_as_a_judge_evaluations | ||
parent: llm_obs_evaluations | ||
identifier: llm_obs_custom_llm_as_a_judge_evaluations | ||
weight: 400 | ||
- name: Managed | ||
url: llm_observability/evaluations/managed_evaluations | ||
parent: llm_obs_evaluations | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4770,6 +4770,11 @@ menu: | |
parent: llm_obs | ||
identifier: llm_obs_evaluations | ||
weight: 4 | ||
- name: Custom LLM-as-a-Judge | ||
|
||
url: llm_observability/evaluations/custom_llm_as_a_judge_evaluations | ||
parent: llm_obs_evaluations | ||
identifier: llm_obs_custom_llm_as_a_judge_evaluations | ||
weight: 400 | ||
- name: Managed | ||
url: llm_observability/evaluations/managed_evaluations | ||
parent: llm_obs_evaluations | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4770,6 +4770,11 @@ menu: | |
parent: llm_obs | ||
identifier: llm_obs_evaluations | ||
weight: 4 | ||
- name: Custom LLM-as-a-Judge | ||
|
||
url: llm_observability/evaluations/custom_llm_as_a_judge_evaluations | ||
parent: llm_obs_evaluations | ||
identifier: llm_obs_custom_llm_as_a_judge_evaluations | ||
weight: 400 | ||
- name: Managed | ||
url: llm_observability/evaluations/managed_evaluations | ||
parent: llm_obs_evaluations | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,34 +8,39 @@ aliases: | |
|
||
## Overview | ||
|
||
LLM Observability offers several ways to support evaluations. They can be configured by navigating to [**AI Observability > Settings > Evaluations**][7]. | ||
LLM Observability offers several ways to support evaluations. They can be configured by navigating to [**AI Observability > Settings > Evaluations**][8]. | ||
|
||
### Custom LLM-as-a-Judge Evaluations | ||
|
||
[Custom LLM-as-a-Judge Evaluations][1] allow you to define your own evaluation logic using natural language prompts. You can create custom evaluations to assess subjective or objective criteria—like tone, helpfulness, or factuality—and run them at scale across your traces and spans. | ||
|
||
|
||
### Managed Evaluations | ||
|
||
Datadog builds and supports [Managed Evaluations][1] to support common use cases. You can enable and configure them within the LLM Observability application. | ||
Datadog builds and supports [Managed Evaluations][2] to support common use cases. You can enable and configure them within the LLM Observability application. | ||
|
||
### Submit External Evaluations | ||
|
||
You can also submit [External Evaluations][2] using Datadog's API. This mechanism is great if you have your own evaluation system, but would like to centralize that information within Datadog. | ||
You can also submit [External Evaluations][3] using Datadog's API. This mechanism is great if you have your own evaluation system, but would like to centralize that information within Datadog. | ||
|
||
### Evaluation Integrations | ||
|
||
Datadog also supports integrations with some 3rd party evaluation frameworks, such as [Ragas][3] and [NeMo][4]. | ||
Datadog also supports integrations with some 3rd party evaluation frameworks, such as [Ragas][4] and [NeMo][5]. | ||
|
||
### Sensitive Data Scanner integration | ||
|
||
In addition to evaluating the input and output of LLM requests, agents, workflows, or the application, LLM Observability integrates with [Sensitive Data Scanner][5], which helps prevent data leakage by identifying and redacting any sensitive information (such as personal data, financial details, or proprietary information) that may be present in any step of your LLM application. | ||
In addition to evaluating the input and output of LLM requests, agents, workflows, or the application, LLM Observability integrates with [Sensitive Data Scanner][6], which helps prevent data leakage by identifying and redacting any sensitive information (such as personal data, financial details, or proprietary information) that may be present in any step of your LLM application. | ||
|
||
By proactively scanning for sensitive data, LLM Observability ensures that conversations remain secure and compliant with data protection regulations. This additional layer of security reinforces Datadog's commitment to maintaining the confidentiality and integration of user interactions with LLMs. | ||
|
||
### Permissions | ||
|
||
[LLM Observability Write permissions][6] are necessary to configure evaluations. | ||
[LLM Observability Write permissions][7] are necessary to configure evaluations. | ||
|
||
[1]: /llm_observability/evaluations/managed_evaluations | ||
[2]: /llm_observability/evaluations/external_evaluations | ||
[3]: /llm_observability/evaluations/ragas_evaluations | ||
[4]: /llm_observability/evaluations/submit_nemo_evaluations | ||
[5]: /security/sensitive_data_scanner/ | ||
[6]: /account_management/rbac/permissions/#llm-observability | ||
[7]: https://app.datadoghq.com/llm/settings/evaluations | ||
[1]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations | ||
[2]: /llm_observability/evaluations/managed_evaluations | ||
[3]: /llm_observability/evaluations/external_evaluations | ||
[4]: /llm_observability/evaluations/ragas_evaluations | ||
[5]: /llm_observability/evaluations/submit_nemo_evaluations | ||
[6]: /security/sensitive_data_scanner/ | ||
[7]: /account_management/rbac/permissions/#llm-observability | ||
[8]: https://app.datadoghq.com/llm/settings/evaluations |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,180 @@ | ||
--- | ||
title: Custom LLM-as-a-Judge Evaluations | ||
description: Learn how to create Custom LLM-as-a-judge Evaluations. | ||
further_reading: | ||
- link: "/llm_observability/terms/" | ||
tag: "Documentation" | ||
text: "Learn about LLM Observability terms and concepts" | ||
- link: "/llm_observability/setup" | ||
tag: "Documentation" | ||
text: "Learn how to set up LLM Observability" | ||
- link: "/llm_observability/evaluations/managed_evaluations" | ||
tag: "Documentation" | ||
text: "Learn about Managed Evaluations" | ||
- link: "https://www.datadoghq.com/blog/llm-evaluation-framework-best-practices/" | ||
tag: "Blog" | ||
text: "Building an LLM evaluation framework: best practices" | ||
--- | ||
|
||
## Overview | ||
|
||
Custom LLM-as-a-Judge Evaluations let you define your own evaluation logic to automatically assess your LLM applications. You can use natural language prompts to capture subjective or objective criteria—like tone, helpfulness, or factuality—and run them at scale across your traces and spans. | ||
|
||
|
||
This provides a flexible, automated way to monitor model quality, detect regressions, and track improvements over time. | ||
|
||
## How it works | ||
|
||
Custom LLM-as-a-Judge Evaluations use an LLM to judge the performance of another LLM. | ||
|
||
You define: | ||
- The criteria (via prompt text) | ||
- What is evaluated (e.g., a span's output) | ||
- The model (e.g., GPT-4o) | ||
|
||
- The output type (boolean, numeric score, or categorical label) | ||
|
||
|
||
Datadog then runs this evaluation logic automatically against your spans, recording results as structured metrics that you can query, visualize, and monitor. | ||
|
||
|
||
## Create a custom evaluation | ||
|
||
You can create and manage custom evaluations from the [Evaluations page][1] in LLM Observability. | ||
|
||
{{< img src="llm_observability/evaluations/custom_llm_judge_1.png" alt="Begin creating your own Custom LLM-as-a-judge Evaluation by opening the Create Evaluation side panel from the Evaluations page" style="width:100%;" >}} | ||
|
||
|
||
### 1. Name your evaluation | ||
|
||
Give your evaluation a clear, descriptive name (e.g., `factuality-check` or `tone-eval`). You will use this name later when querying evaluation results. The name has to be unique within your application. | ||
|
||
### 2. Choose an LLM provider and model | ||
|
||
Select your LLM account and the model you wish to use for the evaluation. If you do not have an LLM account already integrated with LLM Observability, follow these instructions to [connect an LLM provider][2]. | ||
|
||
### 3. Define the evaluation prompt | ||
|
||
In the **Evaluation Prompt** section, you can either: | ||
- Use preexisting prompt templates, including: | ||
- Failure to Answer | ||
- Prompt Injection | ||
- Sentiment | ||
- Topic Relevancy | ||
- Toxicity | ||
- Create an evaluation from scratch by writing your own criteria. | ||
|
||
Templates can be used as-is or modified to match your specific evaluation logic. | ||
|
||
#### Writing a custom prompt | ||
|
||
In the **System Prompt** field, write clear instructions describing what the evaluator should assess. | ||
|
||
- Focus on a single evaluation goal | ||
- Include 2–3 few-shot examples showing input/output pairs, expected results, and reasoning. | ||
|
||
In the **User Prompt** field, explicitly specify what parts of the span to evaluate: Span Input (`{{span_input}}`), Output (`{{span_output}}`), or both. | ||
|
||
**Example System Prompt:** | ||
|
||
{{< code-block lang="text" >}} | ||
You will be looking at interactions between a user and a budgeting AI agent. Your job is to classify the user's intent when it comes to using the budgeting AI agent. | ||
|
||
You will be given a Span Input, which represents the user's message to the agent, which you will then classify. Here are some examples. | ||
|
||
Span Input: What are the core things I should know about budgeting? | ||
Classification: general_financial_advice | ||
|
||
Span Input: Did I go over budget with my grocery bills last month? | ||
Classification: budgeting_question | ||
|
||
Span Input: What is the category for which I have the highest budget? | ||
Classification: budgeting_question | ||
|
||
Span Input: Based on my past months, what is my ideal budget for subscriptions? | ||
Classification: budgeting_advice | ||
|
||
Span Input: Raise my restaurant budget by $50 | ||
Classification: budgeting_request | ||
|
||
Span Input: Help me plan a trip to the Maldives | ||
Classification: unrelated | ||
{{< /code-block >}} | ||
|
||
**Example User Message:** | ||
|
||
{{< code-block lang="text" >}} | ||
Span Input: {{span_input}} | ||
{{< /code-block >}} | ||
|
||
### 4. Choose an output type | ||
|
||
Define the expected output schema for the evaluator: | ||
|
||
- **Boolean** – True/False results (e.g., "Did the model follow instructions?") | ||
- **Score** – Numeric rating (e.g., 1–5 scale for helpfulness) | ||
- **Categorical** – Discrete labels (e.g., "Good", "Bad", "Neutral") | ||
|
||
The schema ensures your results are structured for querying and dashboarding. For Anthropic and Bedrock models, only Boolean output types are allowed. | ||
|
||
|
||
You can preview and refine your logic in the [**Test Evaluation**](#6-test-your-evaluation) panel by providing sample span input/output and clicking **Run Evaluation** to verify outputs. | ||
|
||
### 5. Configure filters and sampling | ||
|
||
Choose which application and spans to evaluate: | ||
|
||
- **Traces** – Evaluate only root spans | ||
- **All Spans** – Include all spans | ||
- **Span Names** – Target spans by name | ||
- **Tags** – Limit evaluation to spans with certain tags | ||
|
||
Optionally, apply sampling (for example, 10%) to control evaluation cost. | ||
|
||
### 6. Test your evaluation | ||
|
||
Use the **Test Evaluation** panel on the right to preview how your evaluator performs. | ||
|
||
You can input sample `{{span_input}}` and `{{span_output}}` values, then click **Run Evaluation** to see the LLM-as-a-Judge's output before saving. Modify your evaluation until you are satisfied with the results. | ||
|
||
{{< img src="llm_observability/evaluations/custom_llm_judge_2.png" alt="The Test Evaluation panel allows you to preview your evaluation before saving." style="width:100%;" >}} | ||
|
||
|
||
## Viewing and using results | ||
|
||
After an evaluation has been saved, it will automatically run on targeted spans and results will be available across LLM Observability in near-realtime. Custom LLM-as-a-judge results for a specific span can be found in the **Evaluations** tab next to all other evaluations. | ||
|
||
{{< img src="llm_observability/evaluations/custom_llm_judge_3.png" alt="View custom evaluation results alongside managed evaluations in the Evaluations tab of a trace" style="width:100%;" >}} | ||
|
||
Use the syntax `@evaluations.custom.<evaluation_name>` to query or visualize results. | ||
|
||
For example: | ||
``` | ||
@evaluations.custom.helpfulness-check | ||
``` | ||
|
||
{{< img src="llm_observability/evaluations/custom_llm_judge_4.png" alt="Filter and query traces using custom evaluation results in the LLM Observability Traces page" style="width:100%;" >}} | ||
|
||
|
||
You can: | ||
- Filter traces by evaluation results | ||
- Use evaluation results as [facets][5] | ||
- View aggregate results in the LLM Observability Overview page's Evaluation section | ||
- Create [monitors][6] to alert on performance changes or regression | ||
|
||
## Best practices for reliable custom evaluations | ||
|
||
- **Start small**: Target a single, well-defined failure mode before scaling. | ||
- **Iterate**: Run, inspect outputs, and refine your prompt. | ||
- **Validate**: Periodically check evaluator accuracy using sampled traces. | ||
- **Document your rubric**: Clearly define what "Pass" and "Fail" mean to avoid drift over time. | ||
- **Re-align your evaluator**: Reassess prompt and few-shot examples when the underlying LLM updates. | ||
|
||
For more resources on best practices, check out [Building an LLM evaluation framework: best practices][3] and [Using LLM-as-a-judge for automated and versatile evaluation][4]. | ||
|
||
## Further Reading | ||
|
||
{{< partial name="whats-next/whats-next.html" >}} | ||
|
||
[1]: https://app.datadoghq.com/llm/settings/evaluations | ||
[2]: /llm_observability/evaluations/managed_evaluations#connect-your-llm-provider-account | ||
[3]: https://www.datadoghq.com/blog/llm-evaluation-framework-best-practices/ | ||
[4]: https://huggingface.co/learn/cookbook/llm_judge | ||
[5]: https://docs.datadoghq.com/service_management/events/explorer/facets/ | ||
[6]: https://docs.datadoghq.com/monitors/ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to change these, just the english one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done