Skip to content

Commit 5036d05

Browse files
committed
initial commit
1 parent df67cc9 commit 5036d05

File tree

2 files changed

+203
-34
lines changed

2 files changed

+203
-34
lines changed
Lines changed: 169 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,169 @@
1+
---
2+
title: Custom LLM-as-a-Judge Evaluations
3+
description: Learn how to create custom evaluations for your LLM applications using LLM-as-a-Judge.
4+
further_reading:
5+
- link: "/llm_observability/terms/"
6+
tag: "Documentation"
7+
text: "Learn about LLM Observability terms and concepts"
8+
- link: "/llm_observability/setup"
9+
tag: "Documentation"
10+
text: "Learn how to set up LLM Observability"
11+
- link: "/llm_observability/evaluations/managed_evaluations"
12+
tag: "Documentation"
13+
text: "Learn about Managed Evaluations"
14+
- link: "https://www.datadoghq.com/blog/llm-evaluation-framework-best-practices/"
15+
tag: "Blog"
16+
text: "Building an LLM evaluation framework: best practices"
17+
---
18+
19+
## Overview
20+
21+
Custom LLM-as-a-Judge Evaluations let you define your own evaluation logic to automatically assess your LLM applications. You can use natural language prompts to capture subjective or objective criteria—like tone, helpfulness, or factuality—and run them at scale across your traces and spans.
22+
23+
This provides a flexible, automated way to monitor model quality, detect regressions, and track improvements over time.
24+
25+
## How it works
26+
27+
Custom LLM-as-a-Judge Evaluations use an LLM to judge the performance of another LLM.
28+
29+
You define:
30+
- The criteria (via prompt text)
31+
- What is evaluated (for example, a span's output)
32+
- The model (for example, GPT-4o)
33+
- The output type (Boolean, numeric score, or categorical label)
34+
35+
Datadog then runs this evaluation logic automatically against your spans, recording results as structured metrics that you can query, visualize, and monitor.
36+
37+
## Create a custom evaluation
38+
39+
You can create and manage custom evaluations from the [Evaluations page][1] in LLM Observability.
40+
41+
### 1. Name your evaluation
42+
43+
Give your evaluation a clear, descriptive name (for example, `factuality-check` or `tone-eval`).
44+
45+
You'll use this name later when querying evaluation results. The name has to be unique within your application.
46+
47+
### 2. Choose a model provider and model
48+
49+
Select your LLM account and the model you wish to use for the evaluation. If you do not have an LLM account already integrated with LLM Observability, follow [these instructions][2] to connect an LLM provider.
50+
51+
### 3. Define the evaluation prompt
52+
53+
In the **Evaluation Prompt** section, you can either:
54+
- Use preexisting prompt templates, including:
55+
- Failure to Answer
56+
- Prompt Injection
57+
- Sentiment
58+
- Topic Relevancy
59+
- Toxicity
60+
- Or create an evaluation from scratch by writing your own criteria.
61+
62+
Templates can be used as-is or fully customized to match your specific evaluation logic.
63+
64+
#### Writing a custom prompt
65+
66+
In the **System Prompt** field, write clear instructions describing what the evaluator should assess.
67+
68+
- Focus on a single evaluation goal
69+
- Include 2–3 few-shot examples showing input/output pairs, expected results, and reasoning.
70+
71+
In the **User Prompt** field, explicitly specify what parts of the span to evaluate: Span Input, Output, or Both.
72+
73+
**Example System Prompt:**
74+
75+
{{< code-block lang="text" >}}
76+
You will be looking at interactions between a user and a budgeting AI agent. Your job is to classify the user's intent when it comes to using the budgeting AI agent.
77+
78+
You will be given a Span Input, which represents the user's message to the agent, which you will then classify. Here are some examples.
79+
80+
Span Input: What are the core things I should know about budgeting?
81+
Classification: general_financial_advice
82+
83+
Span Input: Did I go over budget with my grocery bills last month?
84+
Classification: budgeting_question
85+
86+
Span Input: What is the category for which I have the highest budget?
87+
Classification: budgeting_question
88+
89+
Span Input: Based on my past months, what is my ideal budget for subscriptions?
90+
Classification: budgeting_advice
91+
92+
Span Input: Raise my restaurant budget by $50
93+
Classification: budgeting_request
94+
95+
Span Input: Help me plan a trip to the Maldives
96+
Classification: unrelated
97+
{{< /code-block >}}
98+
99+
**Example User Message:**
100+
101+
{{< code-block lang="text" >}}
102+
Span Input: {{span_input}}
103+
{{< /code-block >}}
104+
105+
### 4. Choose output type
106+
107+
Define the expected output schema for the evaluator:
108+
109+
- **Boolean** – True/False results (for example, "Did the model follow instructions?")
110+
- **Score** – Numeric rating (for example, 1–5 scale for helpfulness)
111+
- **Categorical** – Discrete labels (for example, "Good", "Bad", "Neutral")
112+
113+
The schema ensures your results are structured for querying and dashboarding. For Anthropic and Bedrock models, only Boolean output types are available.
114+
115+
You can preview and refine your logic in the **Test Evaluation** panel by providing sample span input/output and clicking **Run Evaluation** to verify outputs before saving.
116+
117+
### 5. Configure filters and sampling
118+
119+
Choose which application and spans to evaluate:
120+
121+
- **Traces** – Evaluate only root spans
122+
- **All Spans** – Include all spans
123+
- **Span Names** – Target spans by name
124+
- **Tags** – Limit evaluation to spans with certain tags
125+
126+
Optionally, apply sampling (for example, 10%) to control evaluation cost.
127+
128+
### 6. Test your evaluation
129+
130+
Use the **Test Evaluation** panel on the right to preview how your evaluator performs.
131+
132+
You can input sample `{{span_input}}` and `{{span_output}}` values, then click **Run Evaluation** to see the LLM-as-a-Judge's output before saving. Modify your evaluation until you are satisfied with the results.
133+
134+
## Viewing and using results
135+
136+
After an evaluation has been saved, it will automatically run on targeted spans and results will be available across LLM Observability in near-realtime.
137+
138+
Use the syntax `@evaluations.custom.<evaluation_name>` to query or visualize results.
139+
140+
For example:
141+
```
142+
@evaluations.custom.helpfulness-check
143+
```
144+
145+
You can:
146+
- Filter traces by evaluation results
147+
- Use evaluation results as facets
148+
- View aggregate results in the LLM Observability Overview page's Evaluation section
149+
- Create monitors to alert on performance changes or regression
150+
151+
## Best practices for reliable custom evaluations
152+
153+
- **Start small**: Target a single, well-defined failure mode before scaling.
154+
- **Iterate**: Run, inspect outputs, and refine your prompt.
155+
- **Validate**: Periodically check evaluator accuracy using sampled traces.
156+
- **Document your rubric**: Clearly define what "Pass" and "Fail" mean to avoid drift over time.
157+
- **Re-align your evaluator**: Reassess prompt and few-shot examples when the underlying LLM updates.
158+
159+
For more resources on best practices, check out [Building an LLM evaluation framework: best practices][3] and [Using LLM-as-a-judge for automated and versatile evaluation][4].
160+
161+
## Further Reading
162+
163+
{{< partial name="whats-next/whats-next.html" >}}
164+
165+
[1]: https://app.datadoghq.com/llm/settings/evaluations
166+
[2]: /llm_observability/evaluations/managed_evaluations#connect-your-llm-provider-account
167+
[3]: https://www.datadoghq.com/blog/llm-evaluation-framework-best-practices/
168+
[4]: https://huggingface.co/learn/cookbook/llm_judge
169+

content/en/llm_observability/evaluations/managed_evaluations.md

Lines changed: 34 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -134,8 +134,8 @@ Each of these metrics has `ml_app`, `model_server`, `model_provider`, `model_nam
134134
#### Topic relevancy
135135

136136
This check identifies and flags user inputs that deviate from the configured acceptable input topics. This ensures that interactions stay pertinent to the LLM's designated purpose and scope.
137-
138-
| Evaluation Stage | Evaluation Method | Evaluation Definition |
137+
138+
| Evaluation Stage | Evaluation Method | Evaluation Definition |
139139
|---|---|---|
140140
| Evaluated on Input | Evaluated using LLM | Topic relevancy assesses whether each prompt-response pair remains aligned with the intended subject matter of the Large Language Model (LLM) application. For instance, an e-commerce chatbot receiving a question about a pizza recipe would be flagged as irrelevant. |
141141

@@ -156,7 +156,7 @@ This check identifies instances where the LLM makes a claim that disagrees with
156156

157157
{{< img src="llm_observability/evaluations/hallucination_1.png" alt="A Hallucination evaluation detected by an LLM in LLM Observability" style="width:100%;" >}}
158158

159-
| Evaluation Stage | Evaluation Method | Evaluation Definition |
159+
| Evaluation Stage | Evaluation Method | Evaluation Definition |
160160
|---|---|---|
161161
| Evaluated on Output | Evaluated using LLM | Hallucination flags any output that disagrees with the context provided to the LLM. |
162162

@@ -217,13 +217,13 @@ This check identifies instances where the LLM fails to deliver an appropriate re
217217

218218
{{< img src="llm_observability/evaluations/failure_to_answer_1.png" alt="A Failure to Answer evaluation detected by an LLM in LLM Observability" style="width:100%;" >}}
219219

220-
| Evaluation Stage | Evaluation Method | Evaluation Definition |
220+
| Evaluation Stage | Evaluation Method | Evaluation Definition |
221221
|---|---|---|
222222
| Evaluated on Output | Evaluated using LLM | Failure To Answer flags whether each prompt-response pair demonstrates that the LLM application has provided a relevant and satisfactory answer to the user's question. |
223223

224224
##### Failure to Answer Configuration
225225
<div class="alert alert-info">Configuring failure to answer evaluation categories is supported if OpenAI or Azure OpenAI is selected as your LLM provider.</div>
226-
You can configure the Failure to Answer evaluation to use specific categories of failure to answer, listed in the following table.
226+
You can configure the Failure to Answer evaluation to use specific categories of failure to answer, listed in the following table.
227227

228228
| Configuration Option | Description | Example(s) |
229229
|---|---|---|
@@ -245,7 +245,7 @@ Afrikaans, Albanian, Arabic, Armenian, Azerbaijani, Belarusian, Bengali, Norwegi
245245

246246
{{< img src="llm_observability/evaluations/language_mismatch_1.png" alt="A Language Mismatch evaluation detected by an open source model in LLM Observability" style="width:100%;" >}}
247247

248-
| Evaluation Stage | Evaluation Method | Evaluation Definition |
248+
| Evaluation Stage | Evaluation Method | Evaluation Definition |
249249
|---|---|---|
250250
| Evaluated on Input and Output | Evaluated using Open Source Model | Language Mismatch flags whether each prompt-response pair demonstrates that the LLM application answered the user's question in the same language that the user used. |
251251

@@ -255,7 +255,7 @@ This check helps understand the overall mood of the conversation, gauge user sat
255255

256256
{{< img src="llm_observability/evaluations/sentiment_1.png" alt="A Sentiment evaluation detected by an LLM in LLM Observability" style="width:100%;" >}}
257257

258-
| Evaluation Stage | Evaluation Method | Evaluation Definition |
258+
| Evaluation Stage | Evaluation Method | Evaluation Definition |
259259
|---|---|---|
260260
| Evaluated on Input and Output | Evaluated using LLM | Sentiment flags the emotional tone or attitude expressed in the text, categorizing it as positive, negative, or neutral. |
261261

@@ -265,7 +265,7 @@ This check evaluates whether your LLM chatbot can successfully carry out a full
265265

266266
{{< img src="llm_observability/evaluations/goal_completeness.png" alt="A Goal Completeness evaluation detected by an LLM in LLM Observability" style="width:100%;" >}}
267267

268-
| Evaluation Stage | Evaluation Method | Evaluation Definition |
268+
| Evaluation Stage | Evaluation Method | Evaluation Definition |
269269
|---|---|---|
270270
| Evaluated on session | Evaluated using LLM | Goal Completeness assesses whether all user intentions within a multi-turn interaction were successfully resolved. The evaluation identifies resolved and unresolved intentions, providing a completeness score based on the ratio of unresolved to total intentions. |
271271

@@ -314,7 +314,7 @@ This check evaluates whether the agent has successfully selected the appropriate
314314

315315
{{< img src="llm_observability/evaluations/tool_selection_failure.png" alt="A tool selection failure detected by the evaluation in LLM Observability" style="width:100%;" >}}
316316

317-
| Evaluation Stage | Evaluation Method | Evaluation Definition |
317+
| Evaluation Stage | Evaluation Method | Evaluation Definition |
318318
|---|---|---|
319319
| Evaluated on LLM spans| Evaluated using LLM | Tool Selection verifies that the tools chosen by the LLM align with the user's request and the available tools. The evaluation identifies cases where irrelevant or incorrect tool calls were made.|
320320

@@ -339,9 +339,9 @@ def subtract_numbers(a: int, b: int) -> int:
339339
Subtracts two numbers.
340340
"""
341341
return a - b
342-
343342

344-
# List of tools available to the agent
343+
344+
# List of tools available to the agent
345345
math_tutor_agent = Agent(
346346
name="Math Tutor",
347347
handoff_description="Specialist agent for math questions",
@@ -360,21 +360,21 @@ history_tutor_agent = Agent(
360360
)
361361

362362
# The triage agent decides which specialized agent to hand off the task to — another type of tool selection covered by this evaluation.
363-
triage_agent = Agent(
363+
triage_agent = Agent(
364364
'openai:gpt-4o',
365365
model_settings=ModelSettings(temperature=0),
366-
instructions='What is the sum of 1 to 10?',
366+
instructions='What is the sum of 1 to 10?',
367367
handoffs=[math_tutor_agent, history_tutor_agent],
368368
)
369369
{{< /code-block >}}
370370

371371
#### Tool argument correctness
372372

373-
This check looks at the arguments provided to a selected tool, and it evaluates whether these arguments match the expected type and make sense given the tool's context.
373+
This check looks at the arguments provided to a selected tool, and it evaluates whether these arguments match the expected type and make sense given the tool's context.
374374

375375
{{< img src="llm_observability/evaluations/tool_argument_correctness_error.png" alt="A tool argument correctness error detected by the evaluation in LLM Observability" style="width:100%;" >}}
376376

377-
| Evaluation Stage | Evaluation Method | Evaluation Definition |
377+
| Evaluation Stage | Evaluation Method | Evaluation Definition |
378378
|---|---|---|
379379
| Evaluated on LLM spans| Evaluated using LLM | Tool Argument Correctness verifies that the arguments provided to a tool by the LLM are correct and contextually relevant. This evaluation identifies cases where the arguments provided to the tool are incorrect according to the tool schema (for example: the argument is expected to be an integer rather than a string) and are not relevant (for example: the argument is a country, but the model provides the name of a city).|
380380

@@ -403,7 +403,7 @@ def subtract_numbers(a: int, b: int) -> int:
403403
"""
404404
return a - b
405405

406-
406+
407407
def multiply_numbers(a: int, b: int) -> int:
408408
"""
409409
Multiplies two numbers.
@@ -441,7 +441,7 @@ history_tutor_agent = Agent(
441441
)
442442

443443
# Create the triage agent
444-
# Note: pydantic_ai handles handoffs differently - you'd typically use result_type
444+
# Note: pydantic_ai handles handoffs differently - you'd typically use result_type
445445
# or custom logic to route between agents
446446
triage_agent = Agent(
447447
'openai:gpt-5-nano',
@@ -470,27 +470,27 @@ result = triage_agent.run_sync(
470470
This check evaluates each input prompt from the user and the response from the LLM application for toxic content. This check identifies and flags toxic content to ensure that interactions remain respectful and safe.
471471

472472
{{< img src="llm_observability/evaluations/toxicity_1.png" alt="A Toxicity evaluation detected by an LLM in LLM Observability" style="width:100%;" >}}
473-
474-
| Evaluation Stage | Evaluation Method | Evaluation Definition |
473+
474+
| Evaluation Stage | Evaluation Method | Evaluation Definition |
475475
|---|---|---|
476476
| Evaluated on Input and Output | Evaluated using LLM | Toxicity flags any language or behavior that is harmful, offensive, or inappropriate, including but not limited to hate speech, harassment, threats, and other forms of harmful communication. |
477477

478478
##### Toxicity configuration
479479

480480
<div class="alert alert-info">Configuring toxicity evaluation categories is supported if OpenAI or Azure OpenAI is selected as your LLM provider.</div>
481-
You can configure toxicity evaluations to use specific categories of toxicity, listed in the following table.
481+
You can configure toxicity evaluations to use specific categories of toxicity, listed in the following table.
482482

483-
| Category | Description |
483+
| Category | Description |
484484
|---|---|
485-
| Discriminatory Content | Content that discriminates against a particular group, including based on race, gender, sexual orientation, culture, etc. |
486-
| Harassment | Content that expresses, incites, or promotes negative or intrusive behavior toward an individual or group. |
487-
| Hate | Content that expresses, incites, or promotes hate based on race, gender, ethnicity, religion, nationality, sexual orientation, disability status, or caste. |
488-
| Illicit | Content that asks, gives advice, or instruction on how to commit illicit acts. |
489-
| Self Harm | Content that promotes, encourages, or depicts acts of self-harm, such as suicide, cutting, and eating disorders. |
490-
| Sexual | Content that describes or alludes to sexual activity. |
491-
| Violence | Content that discusses death, violence, or physical injury. |
492-
| Profanity | Content containing profanity. |
493-
| User Dissatisfaction | Content containing criticism towards the model. *This category is only available for evaluating input toxicity.* |
485+
| Discriminatory Content | Content that discriminates against a particular group, including based on race, gender, sexual orientation, culture, etc. |
486+
| Harassment | Content that expresses, incites, or promotes negative or intrusive behavior toward an individual or group. |
487+
| Hate | Content that expresses, incites, or promotes hate based on race, gender, ethnicity, religion, nationality, sexual orientation, disability status, or caste. |
488+
| Illicit | Content that asks, gives advice, or instruction on how to commit illicit acts. |
489+
| Self Harm | Content that promotes, encourages, or depicts acts of self-harm, such as suicide, cutting, and eating disorders. |
490+
| Sexual | Content that describes or alludes to sexual activity. |
491+
| Violence | Content that discusses death, violence, or physical injury. |
492+
| Profanity | Content containing profanity. |
493+
| User Dissatisfaction | Content containing criticism towards the model. *This category is only available for evaluating input toxicity.* |
494494

495495
The toxicity categories in this table are informed by: [Banko et al. (2020)][14], [Inan et al. (2023)][15], [Ghosh et al. (2024)][16], [Zheng et al. (2024)][17].
496496

@@ -500,13 +500,13 @@ This check identifies attempts by unauthorized or malicious authors to manipulat
500500

501501
{{< img src="llm_observability/evaluations/prompt_injection_1.png" alt="A Prompt Injection evaluation detected by an LLM in LLM Observability" style="width:100%;" >}}
502502

503-
| Evaluation Stage | Evaluation Method | Evaluation Definition |
503+
| Evaluation Stage | Evaluation Method | Evaluation Definition |
504504
|---|---|---|
505505
| Evaluated on Input | Evaluated using LLM | [Prompt Injection][13] flags any unauthorized or malicious insertion of prompts or cues into the conversation by an external party or user. |
506506

507507
##### Prompt injection configuration
508508
<div class="alert alert-info">Configuring prompt injection evaluation categories is supported if OpenAI or Azure OpenAI is selected as your LLM provider.</div>
509-
You can configure the prompt injection evaluation to use specific categories of prompt injection, listed in the following table.
509+
You can configure the prompt injection evaluation to use specific categories of prompt injection, listed in the following table.
510510

511511
| Configuration Option | Description | Example(s) |
512512
|---|---|---|
@@ -520,8 +520,8 @@ You can configure the prompt injection evaluation to use specific categories of
520520
This check ensures that sensitive information is handled appropriately and securely, reducing the risk of data breaches or unauthorized access.
521521

522522
{{< img src="llm_observability/evaluations/sensitive_data_scanning_1.png" alt="A Security and Safety evaluation detected by the Sensitive Data Scanner in LLM Observability" style="width:100%;" >}}
523-
524-
| Evaluation Stage | Evaluation Method | Evaluation Definition |
523+
524+
| Evaluation Stage | Evaluation Method | Evaluation Definition |
525525
|---|---|---|
526526
| Evaluated on Input and Output | Sensitive Data Scanner | Powered by the [Sensitive Data Scanner][4], LLM Observability scans, identifies, and redacts sensitive information within every LLM application's prompt-response pairs. This includes personal information, financial data, health records, or any other data that requires protection due to privacy or security concerns. |
527527

0 commit comments

Comments
 (0)