|
| 1 | +--- |
| 2 | +title: Custom LLM-as-a-Judge Evaluations |
| 3 | +description: How to create custom LLM-as-a-judge evaluations, and how to use these evaluation results across LLM Observability. |
| 4 | +further_reading: |
| 5 | +- link: "/llm_observability/terms/" |
| 6 | + tag: "Documentation" |
| 7 | + text: "Learn about LLM Observability terms and concepts" |
| 8 | +- link: "/llm_observability/setup" |
| 9 | + tag: "Documentation" |
| 10 | + text: "Learn how to set up LLM Observability" |
| 11 | +- link: "/llm_observability/evaluations/managed_evaluations" |
| 12 | + tag: "Documentation" |
| 13 | + text: "Learn about managed evaluations" |
| 14 | +- link: "https://www.datadoghq.com/blog/llm-evaluation-framework-best-practices/" |
| 15 | + tag: "Blog" |
| 16 | + text: "Building an LLM evaluation framework: best practices" |
| 17 | +- link: "https://huggingface.co/learn/cookbook/llm_judge" |
| 18 | + tag: "Hugging Face" |
| 19 | + text: "Using LLM-as-a-judge for an automated and versatile evaluation" |
| 20 | +--- |
| 21 | + |
| 22 | +Custom LLM-as-a-judge evaluations use an LLM to judge the performance of another LLM. You can define evaluation logic with natural language prompts, capture subjective or objective criteria (like tone, helpfulness, or factuality), and run these evaluations at scale across your traces and spans. |
| 23 | + |
| 24 | +## Create a custom LLM-as-a-judge evaluation |
| 25 | + |
| 26 | +You can create and manage custom evaluations from the [Evaluations page][1] in LLM Observability. |
| 27 | + |
| 28 | +1. In Datadog, navigate to the LLM Observability [Evaluations page][1]. Select **Create Evaluation**, then select **Create your own**. |
| 29 | + {{< img src="llm_observability/evaluations/custom_llm_judge_1.png" alt="The LLM Observability Evaluations page with the Create Evaluation side panel opened. The first item, 'Create your own,' is selected. " style="width:100%;" >}} |
| 30 | + |
| 31 | +1. Provide a clear, descriptive **evaluation name** (for example, `factuality-check` or `tone-eval`). You will use this name when querying evaluation results. The name must be unique within your application. |
| 32 | + |
| 33 | +1. Use the **Account** drop-down menu to select the LLM provider and corresponding account to use for your LLM judge. To connect a new account, see [connect an LLM provider][2]. |
| 34 | + |
| 35 | +1. Use the **Model** drop-down menu to select a model to use for your LLM judge. |
| 36 | + |
| 37 | +1. Under **Evaluation Prompt** section, use the **Prompt Template** drop-down menu: |
| 38 | + - **Create from scratch**: Use your own custom prompt (defined in the next step). |
| 39 | + - **Failure to Answer**, **Prompt Injection**, **Sentiment**, etc.: Populate a pre-existing prompt template. You can use these templates as-is, or modify them to match your specific evaluation logic. |
| 40 | + |
| 41 | +1. In the **System Prompt** field, enter your custom prompt or modify a prompt template. |
| 42 | + For custom prompts, provide clear instructions describing what the evaluator should assess. |
| 43 | + |
| 44 | + - Focus on a single evaluation goal |
| 45 | + - Include 2–3 few-shot examples showing input/output pairs, expected results, and reasoning. |
| 46 | + |
| 47 | +{{% collapse-content title="Example custom prompt" level="h4" expanded=false id="custom-prompt-example" %}} |
| 48 | +**System Prompt** |
| 49 | +``` |
| 50 | +You will be looking at interactions between a user and a budgeting AI agent. Your job is to classify the user's intent when it comes to using the budgeting AI agent. |
| 51 | +
|
| 52 | +You will be given a Span Input, which represents the user's message to the agent, which you will then classify. Here are some examples. |
| 53 | +
|
| 54 | +Span Input: What are the core things I should know about budgeting? |
| 55 | +Classification: general_financial_advice |
| 56 | +
|
| 57 | +Span Input: Did I go over budget with my grocery bills last month? |
| 58 | +Classification: budgeting_question |
| 59 | +
|
| 60 | +Span Input: What is the category for which I have the highest budget? |
| 61 | +Classification: budgeting_question |
| 62 | +
|
| 63 | +Span Input: Based on my past months, what is my ideal budget for subscriptions? |
| 64 | +Classification: budgeting_advice |
| 65 | +
|
| 66 | +Span Input: Raise my restaurant budget by $50 |
| 67 | +Classification: budgeting_request |
| 68 | +
|
| 69 | +Span Input: Help me plan a trip to the Maldives |
| 70 | +Classification: unrelated |
| 71 | +``` |
| 72 | + |
| 73 | +**User** |
| 74 | + |
| 75 | +``` |
| 76 | +Span Input: {{span_input}} |
| 77 | +``` |
| 78 | +{{% /collapse-content %}} |
| 79 | + |
| 80 | +7. In the **User** field, provide your user prompt. Explicitly specify what parts of the span to evaluate: Span Input (`{{span_input}}`), Output (`{{span_output}}`), or both. |
| 81 | + |
| 82 | +7. Select an evaluation output type: |
| 83 | + |
| 84 | + - **Boolean**: True/false results (for example, "Did the model follow instructions?") |
| 85 | + - **Score**: Numeric ratings (for example, a 1–5 scale for helpfulness) |
| 86 | + - **Categorical**: Discrete labels (for example, "Good", "Bad", "Neutral") |
| 87 | + <div class="alert alert-info">For Anthropic and Amazon Bedrock models, only the <strong>Boolean</strong> output type is available.</div> |
| 88 | + |
| 89 | +7. Define the structure of your output. |
| 90 | + |
| 91 | + {{< tabs >}} |
| 92 | + {{% tab "OpenAI" %}} |
| 93 | + {{% llm-eval-output-json %}} |
| 94 | + {{% /tab %}} |
| 95 | + |
| 96 | + {{% tab "Azure OpenAI" %}} |
| 97 | + {{% llm-eval-output-json %}} |
| 98 | + {{% /tab %}} |
| 99 | + |
| 100 | + {{% tab "Anthropic" %}} |
| 101 | + {{% llm-eval-output-keyword %}} |
| 102 | + {{% /tab %}} |
| 103 | + |
| 104 | + {{% tab "Amazon Bedrock" %}} |
| 105 | + {{% llm-eval-output-keyword %}} |
| 106 | + {{% /tab %}} |
| 107 | + {{< /tabs >}} |
| 108 | + |
| 109 | +7. Under **Evaluation Scope**, define the scope of your evaluation: |
| 110 | + - **Application**: Select the application you want to evaluate. |
| 111 | + - **Evaluate On**: Choose one of the following: |
| 112 | + - **Traces**: Evaluate only root spans |
| 113 | + - **All Spans**: Evaluate both root and child spans |
| 114 | + - **Span Names**: (Optional) Limit evaluation to spans with certain names. |
| 115 | + - **Tags**: (Optional) Limit evaluation to spans with certain tags. |
| 116 | + - **Sampling Rate**: (Optional) Apply sampling (for example, 10%) to control evaluation cost. |
| 117 | + |
| 118 | +7. Use the **Test Evaluation** panel on the right to preview how your evaluator performs. You can input sample `{{span_input}}` and `{{span_output}}` values, then click **Run Evaluation** to see the LLM-as-a-Judge's output before saving. Modify your evaluation until you are satisfied with the results. |
| 119 | + |
| 120 | +{{< img src="llm_observability/evaluations/custom_llm_judge_2.png" alt="Creation flow for a custom LLM-as-a-judge evaluation. On the right, under Test Evaluation, sample span_input and span_output have been provided. An Evaluation Result textbox below displays a sample result." style="width:100%;" >}} |
| 121 | + |
| 122 | + |
| 123 | +## Viewing and using results |
| 124 | + |
| 125 | +After you save your evaluation, Datadog automatically runs your evaluation on targeted spans. Results are available across LLM Observability in near-real-time. You can find your custom LLM-as-a-judge results for a specific span in the **Evaluations** tab, next to all other evaluations. |
| 126 | + |
| 127 | +{{< img src="llm_observability/evaluations/custom_llm_judge_3.png" alt="The Evaluations tab of a trace, displaying custom evaluation results alongside managed evaluations." style="width:100%;" >}} |
| 128 | + |
| 129 | +Use the syntax `@evaluations.custom.<evaluation_name>` to query or visualize results. |
| 130 | + |
| 131 | +For example: |
| 132 | +``` |
| 133 | +@evaluations.custom.helpfulness-check |
| 134 | +``` |
| 135 | + |
| 136 | +{{< img src="llm_observability/evaluations/custom_llm_judge_4.png" alt="The LLM Observability Traces view. In the search box, the user has entered `@evaluations.custom.budget-guru-intent-classifier:budgeting_question` and results are populated below." style="width:100%;" >}} |
| 137 | + |
| 138 | + |
| 139 | +You can: |
| 140 | +- Filter traces by evaluation results |
| 141 | +- Use evaluation results as [facets][3] |
| 142 | +- View aggregate results in the LLM Observability Overview page's Evaluation section |
| 143 | +- Create [monitors][4] to alert on performance changes or regression |
| 144 | + |
| 145 | +## Best practices for reliable custom evaluations |
| 146 | + |
| 147 | +- **Start small**: Target a single, well-defined failure mode before scaling. |
| 148 | +- **Iterate**: Run, inspect outputs, and refine your prompt. |
| 149 | +- **Validate**: Periodically check evaluator accuracy using sampled traces. |
| 150 | +- **Document your rubric**: Clearly define what "Pass" and "Fail" mean to avoid drift over time. |
| 151 | +- **Re-align your evaluator**: Reassess prompt and few-shot examples when the underlying LLM updates. |
| 152 | + |
| 153 | +## Further Reading |
| 154 | + |
| 155 | +{{< partial name="whats-next/whats-next.html" >}} |
| 156 | + |
| 157 | +[1]: https://app.datadoghq.com/llm/settings/evaluations |
| 158 | +[2]: /llm_observability/evaluations/managed_evaluations#connect-your-llm-provider-account |
| 159 | +[3]: /service_management/events/explorer/facets/ |
| 160 | +[4]: /monitors/ |
| 161 | + |
0 commit comments