Skip to content

Commit 818d2fe

Browse files
Documentation for Custom LLM-as-a-judge Evaluations (#32018)
* initial commit * improve * undo changes to managed_evaluations file * add to side panel menu * be more consistent with capitalization * remove from other languages * greg comments * updates * missed capitalization * last few edits --------- Co-authored-by: cecilia saixue watt <[email protected]>
1 parent 7e5bc6e commit 818d2fe

File tree

9 files changed

+254
-17
lines changed

9 files changed

+254
-17
lines changed

config/_default/menus/main.en.yaml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -344,7 +344,7 @@ menu:
344344
url: agent/supported_platforms/heroku/
345345
weight: 405
346346
parent: agent_supported_platforms
347-
- name: MacOS
347+
- name: MacOS
348348
identifier: basic_agent_usage_osx
349349
url: agent/supported_platforms/osx/
350350
weight: 406
@@ -4780,6 +4780,11 @@ menu:
47804780
parent: llm_obs
47814781
identifier: llm_obs_evaluations
47824782
weight: 4
4783+
- name: Custom LLM-as-a-Judge
4784+
url: llm_observability/evaluations/custom_llm_as_a_judge_evaluations
4785+
parent: llm_obs_evaluations
4786+
identifier: llm_obs_custom_llm_as_a_judge_evaluations
4787+
weight: 400
47834788
- name: Managed
47844789
url: llm_observability/evaluations/managed_evaluations
47854790
parent: llm_obs_evaluations

content/en/llm_observability/evaluations/_index.md

Lines changed: 21 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -8,34 +8,39 @@ aliases:
88

99
## Overview
1010

11-
LLM Observability offers several ways to support evaluations. They can be configured by navigating to [**AI Observability > Settings > Evaluations**][7].
11+
LLM Observability offers several ways to support evaluations. They can be configured by navigating to [**AI Observability > Settings > Evaluations**][8].
1212

13-
### Managed Evaluations
13+
### Custom LLM-as-a-judge evaluations
1414

15-
Datadog builds and supports [Managed Evaluations][1] to support common use cases. You can enable and configure them within the LLM Observability application.
15+
[Custom LLM-as-a-judge evaluations][1] allow you to define your own evaluation logic using natural language prompts. You can create custom evaluations to assess subjective or objective criteria (like tone, helpfulness, or factuality) and run them at scale across your traces and spans.
1616

17-
### Submit External Evaluations
17+
### Managed evaluations
1818

19-
You can also submit [External Evaluations][2] using Datadog's API. This mechanism is great if you have your own evaluation system, but would like to centralize that information within Datadog.
19+
Datadog builds and supports [managed evaluations][2] to support common use cases. You can enable and configure them within the LLM Observability application.
2020

21-
### Evaluation Integrations
21+
### Submit external evaluations
2222

23-
Datadog also supports integrations with some 3rd party evaluation frameworks, such as [Ragas][3] and [NeMo][4].
23+
You can also submit [external evaluations][3] using Datadog's API. This mechanism is great if you have your own evaluation system, but would like to centralize that information within Datadog.
24+
25+
### Evaluation integrations
26+
27+
Datadog also supports integrations with some 3rd party evaluation frameworks, such as [Ragas][4] and [NeMo][5].
2428

2529
### Sensitive Data Scanner integration
2630

27-
In addition to evaluating the input and output of LLM requests, agents, workflows, or the application, LLM Observability integrates with [Sensitive Data Scanner][5], which helps prevent data leakage by identifying and redacting any sensitive information (such as personal data, financial details, or proprietary information) that may be present in any step of your LLM application.
31+
In addition to evaluating the input and output of LLM requests, agents, workflows, or the application, LLM Observability integrates with [Sensitive Data Scanner][6], which helps prevent data leakage by identifying and redacting any sensitive information (such as personal data, financial details, or proprietary information) that may be present in any step of your LLM application.
2832

2933
By proactively scanning for sensitive data, LLM Observability ensures that conversations remain secure and compliant with data protection regulations. This additional layer of security reinforces Datadog's commitment to maintaining the confidentiality and integration of user interactions with LLMs.
3034

3135
### Permissions
3236

33-
[LLM Observability Write permissions][6] are necessary to configure evaluations.
37+
[`LLM Observability Write` permissions][7] are necessary to configure evaluations.
3438

35-
[1]: /llm_observability/evaluations/managed_evaluations
36-
[2]: /llm_observability/evaluations/external_evaluations
37-
[3]: /llm_observability/evaluations/ragas_evaluations
38-
[4]: /llm_observability/evaluations/submit_nemo_evaluations
39-
[5]: /security/sensitive_data_scanner/
40-
[6]: /account_management/rbac/permissions/#llm-observability
41-
[7]: https://app.datadoghq.com/llm/settings/evaluations
39+
[1]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations
40+
[2]: /llm_observability/evaluations/managed_evaluations
41+
[3]: /llm_observability/evaluations/external_evaluations
42+
[4]: /llm_observability/evaluations/ragas_evaluations
43+
[5]: /llm_observability/evaluations/submit_nemo_evaluations
44+
[6]: /security/sensitive_data_scanner/
45+
[7]: /account_management/rbac/permissions/#llm-observability
46+
[8]: https://app.datadoghq.com/llm/settings/evaluations
Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
---
2+
title: Custom LLM-as-a-Judge Evaluations
3+
description: How to create custom LLM-as-a-judge evaluations, and how to use these evaluation results across LLM Observability.
4+
further_reading:
5+
- link: "/llm_observability/terms/"
6+
tag: "Documentation"
7+
text: "Learn about LLM Observability terms and concepts"
8+
- link: "/llm_observability/setup"
9+
tag: "Documentation"
10+
text: "Learn how to set up LLM Observability"
11+
- link: "/llm_observability/evaluations/managed_evaluations"
12+
tag: "Documentation"
13+
text: "Learn about managed evaluations"
14+
- link: "https://www.datadoghq.com/blog/llm-evaluation-framework-best-practices/"
15+
tag: "Blog"
16+
text: "Building an LLM evaluation framework: best practices"
17+
- link: "https://huggingface.co/learn/cookbook/llm_judge"
18+
tag: "Hugging Face"
19+
text: "Using LLM-as-a-judge for an automated and versatile evaluation"
20+
---
21+
22+
Custom LLM-as-a-judge evaluations use an LLM to judge the performance of another LLM. You can define evaluation logic with natural language prompts, capture subjective or objective criteria (like tone, helpfulness, or factuality), and run these evaluations at scale across your traces and spans.
23+
24+
## Create a custom LLM-as-a-judge evaluation
25+
26+
You can create and manage custom evaluations from the [Evaluations page][1] in LLM Observability.
27+
28+
1. In Datadog, navigate to the LLM Observability [Evaluations page][1]. Select **Create Evaluation**, then select **Create your own**.
29+
{{< img src="llm_observability/evaluations/custom_llm_judge_1.png" alt="The LLM Observability Evaluations page with the Create Evaluation side panel opened. The first item, 'Create your own,' is selected. " style="width:100%;" >}}
30+
31+
1. Provide a clear, descriptive **evaluation name** (for example, `factuality-check` or `tone-eval`). You will use this name when querying evaluation results. The name must be unique within your application.
32+
33+
1. Use the **Account** drop-down menu to select the LLM provider and corresponding account to use for your LLM judge. To connect a new account, see [connect an LLM provider][2].
34+
35+
1. Use the **Model** drop-down menu to select a model to use for your LLM judge.
36+
37+
1. Under **Evaluation Prompt** section, use the **Prompt Template** drop-down menu:
38+
- **Create from scratch**: Use your own custom prompt (defined in the next step).
39+
- **Failure to Answer**, **Prompt Injection**, **Sentiment**, etc.: Populate a pre-existing prompt template. You can use these templates as-is, or modify them to match your specific evaluation logic.
40+
41+
1. In the **System Prompt** field, enter your custom prompt or modify a prompt template.
42+
For custom prompts, provide clear instructions describing what the evaluator should assess.
43+
44+
- Focus on a single evaluation goal
45+
- Include 2–3 few-shot examples showing input/output pairs, expected results, and reasoning.
46+
47+
{{% collapse-content title="Example custom prompt" level="h4" expanded=false id="custom-prompt-example" %}}
48+
**System Prompt**
49+
```
50+
You will be looking at interactions between a user and a budgeting AI agent. Your job is to classify the user's intent when it comes to using the budgeting AI agent.
51+
52+
You will be given a Span Input, which represents the user's message to the agent, which you will then classify. Here are some examples.
53+
54+
Span Input: What are the core things I should know about budgeting?
55+
Classification: general_financial_advice
56+
57+
Span Input: Did I go over budget with my grocery bills last month?
58+
Classification: budgeting_question
59+
60+
Span Input: What is the category for which I have the highest budget?
61+
Classification: budgeting_question
62+
63+
Span Input: Based on my past months, what is my ideal budget for subscriptions?
64+
Classification: budgeting_advice
65+
66+
Span Input: Raise my restaurant budget by $50
67+
Classification: budgeting_request
68+
69+
Span Input: Help me plan a trip to the Maldives
70+
Classification: unrelated
71+
```
72+
73+
**User**
74+
75+
```
76+
Span Input: {{span_input}}
77+
```
78+
{{% /collapse-content %}}
79+
80+
7. In the **User** field, provide your user prompt. Explicitly specify what parts of the span to evaluate: Span Input (`{{span_input}}`), Output (`{{span_output}}`), or both.
81+
82+
7. Select an evaluation output type:
83+
84+
- **Boolean**: True/false results (for example, "Did the model follow instructions?")
85+
- **Score**: Numeric ratings (for example, a 1–5 scale for helpfulness)
86+
- **Categorical**: Discrete labels (for example, "Good", "Bad", "Neutral")
87+
<div class="alert alert-info">For Anthropic and Amazon Bedrock models, only the <strong>Boolean</strong> output type is available.</div>
88+
89+
7. Define the structure of your output.
90+
91+
{{< tabs >}}
92+
{{% tab "OpenAI" %}}
93+
{{% llm-eval-output-json %}}
94+
{{% /tab %}}
95+
96+
{{% tab "Azure OpenAI" %}}
97+
{{% llm-eval-output-json %}}
98+
{{% /tab %}}
99+
100+
{{% tab "Anthropic" %}}
101+
{{% llm-eval-output-keyword %}}
102+
{{% /tab %}}
103+
104+
{{% tab "Amazon Bedrock" %}}
105+
{{% llm-eval-output-keyword %}}
106+
{{% /tab %}}
107+
{{< /tabs >}}
108+
109+
7. Under **Evaluation Scope**, define the scope of your evaluation:
110+
- **Application**: Select the application you want to evaluate.
111+
- **Evaluate On**: Choose one of the following:
112+
- **Traces**: Evaluate only root spans
113+
- **All Spans**: Evaluate both root and child spans
114+
- **Span Names**: (Optional) Limit evaluation to spans with certain names.
115+
- **Tags**: (Optional) Limit evaluation to spans with certain tags.
116+
- **Sampling Rate**: (Optional) Apply sampling (for example, 10%) to control evaluation cost.
117+
118+
7. Use the **Test Evaluation** panel on the right to preview how your evaluator performs. You can input sample `{{span_input}}` and `{{span_output}}` values, then click **Run Evaluation** to see the LLM-as-a-Judge's output before saving. Modify your evaluation until you are satisfied with the results.
119+
120+
{{< img src="llm_observability/evaluations/custom_llm_judge_2.png" alt="Creation flow for a custom LLM-as-a-judge evaluation. On the right, under Test Evaluation, sample span_input and span_output have been provided. An Evaluation Result textbox below displays a sample result." style="width:100%;" >}}
121+
122+
123+
## Viewing and using results
124+
125+
After you save your evaluation, Datadog automatically runs your evaluation on targeted spans. Results are available across LLM Observability in near-real-time. You can find your custom LLM-as-a-judge results for a specific span in the **Evaluations** tab, next to all other evaluations.
126+
127+
{{< img src="llm_observability/evaluations/custom_llm_judge_3.png" alt="The Evaluations tab of a trace, displaying custom evaluation results alongside managed evaluations." style="width:100%;" >}}
128+
129+
Use the syntax `@evaluations.custom.<evaluation_name>` to query or visualize results.
130+
131+
For example:
132+
```
133+
@evaluations.custom.helpfulness-check
134+
```
135+
136+
{{< img src="llm_observability/evaluations/custom_llm_judge_4.png" alt="The LLM Observability Traces view. In the search box, the user has entered `@evaluations.custom.budget-guru-intent-classifier:budgeting_question` and results are populated below." style="width:100%;" >}}
137+
138+
139+
You can:
140+
- Filter traces by evaluation results
141+
- Use evaluation results as [facets][3]
142+
- View aggregate results in the LLM Observability Overview page's Evaluation section
143+
- Create [monitors][4] to alert on performance changes or regression
144+
145+
## Best practices for reliable custom evaluations
146+
147+
- **Start small**: Target a single, well-defined failure mode before scaling.
148+
- **Iterate**: Run, inspect outputs, and refine your prompt.
149+
- **Validate**: Periodically check evaluator accuracy using sampled traces.
150+
- **Document your rubric**: Clearly define what "Pass" and "Fail" mean to avoid drift over time.
151+
- **Re-align your evaluator**: Reassess prompt and few-shot examples when the underlying LLM updates.
152+
153+
## Further Reading
154+
155+
{{< partial name="whats-next/whats-next.html" >}}
156+
157+
[1]: https://app.datadoghq.com/llm/settings/evaluations
158+
[2]: /llm_observability/evaluations/managed_evaluations#connect-your-llm-provider-account
159+
[3]: /service_management/events/explorer/facets/
160+
[4]: /monitors/
161+
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
Edit a JSON schema that defines your evaluations output type.
2+
3+
#### Boolean
4+
- Edit the `description` field to further explain what true and false mean in your use case.
5+
6+
#### Score
7+
- Set a `min` and `max` score for your evaluation.
8+
- Edit the `description` field to further explain the scale of your evaluation.
9+
10+
#### Categorical
11+
- Add or remove categories by editing the JSON schema
12+
- Edit category names
13+
- Edit the `description` field of categories to further explain what they mean in the context of your evaluation.
14+
15+
An example schema for a categorical evaluation:
16+
17+
```
18+
{
19+
"name": "categorical_eval",
20+
"schema": {
21+
"type": "object",
22+
"required": [
23+
"categorical_eval"
24+
],
25+
"properties": {
26+
"categorical_eval": {
27+
"type": "string",
28+
"anyOf": [
29+
{
30+
"const": "budgeting_question",
31+
"description": "The user is asking a question about their budget. The answer can be directly determined by looking at their budget and spending."
32+
},
33+
{
34+
"const": "budgeting_request",
35+
"description": "The user is asking to change something about their budget. This should involve an action that changes their budget."
36+
},
37+
{
38+
"const": "budgeting_advice",
39+
"description": "The user is asking for advice on their budget. This should not require a change to their budget, but it should require an analysis of their budget and spending."
40+
},
41+
{
42+
"const": "general_financial_advice",
43+
"description": "The user is asking for general financial advice which is not directly related to their specific budget. However, this can include advice about budgeting in general."
44+
},
45+
{
46+
"const": "unrelated",
47+
"description": "This is a catch-all category for things not related to budgeting or financial advice."
48+
}
49+
]
50+
}
51+
},
52+
"additionalProperties": false
53+
},
54+
"strict": true
55+
}
56+
```
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
Provide **True keywords** and **False keywords** that define when the evaluation result is true or false, respectively.
2+
3+
Datadog searches the LLM-as-a-judge's response text for your defined keywords and provides the appropriate results for the evaluation. For this reason, you should instruct the LLM to respond with your chosen keywords.
4+
5+
For example, if you set:
6+
7+
- **True keywords**: Yes, yes
8+
- **False keywords**: No, no
9+
10+
Then your system prompt should include something like `Respond with "yes" or "no"`.
497 KB
Loading
610 KB
Loading
540 KB
Loading
972 KB
Loading

0 commit comments

Comments
 (0)