diff --git a/config/_default/menus/main.en.yaml b/config/_default/menus/main.en.yaml index 25ae18f61dd17..278699b8eb2b9 100644 --- a/config/_default/menus/main.en.yaml +++ b/config/_default/menus/main.en.yaml @@ -344,7 +344,7 @@ menu: url: agent/supported_platforms/heroku/ weight: 405 parent: agent_supported_platforms - - name: MacOS + - name: MacOS identifier: basic_agent_usage_osx url: agent/supported_platforms/osx/ weight: 406 @@ -4780,6 +4780,11 @@ menu: parent: llm_obs identifier: llm_obs_evaluations weight: 4 + - name: Custom LLM-as-a-Judge + url: llm_observability/evaluations/custom_llm_as_a_judge_evaluations + parent: llm_obs_evaluations + identifier: llm_obs_custom_llm_as_a_judge_evaluations + weight: 400 - name: Managed url: llm_observability/evaluations/managed_evaluations parent: llm_obs_evaluations diff --git a/content/en/llm_observability/evaluations/_index.md b/content/en/llm_observability/evaluations/_index.md index e807ed0a534e7..0e333931bc133 100644 --- a/content/en/llm_observability/evaluations/_index.md +++ b/content/en/llm_observability/evaluations/_index.md @@ -8,34 +8,39 @@ aliases: ## Overview -LLM Observability offers several ways to support evaluations. They can be configured by navigating to [**AI Observability > Settings > Evaluations**][7]. +LLM Observability offers several ways to support evaluations. They can be configured by navigating to [**AI Observability > Settings > Evaluations**][8]. -### Managed Evaluations +### Custom LLM-as-a-judge evaluations -Datadog builds and supports [Managed Evaluations][1] to support common use cases. You can enable and configure them within the LLM Observability application. +[Custom LLM-as-a-judge evaluations][1] allow you to define your own evaluation logic using natural language prompts. You can create custom evaluations to assess subjective or objective criteria (like tone, helpfulness, or factuality) and run them at scale across your traces and spans. -### Submit External Evaluations +### Managed evaluations -You can also submit [External Evaluations][2] using Datadog's API. This mechanism is great if you have your own evaluation system, but would like to centralize that information within Datadog. +Datadog builds and supports [managed evaluations][2] to support common use cases. You can enable and configure them within the LLM Observability application. -### Evaluation Integrations +### Submit external evaluations -Datadog also supports integrations with some 3rd party evaluation frameworks, such as [Ragas][3] and [NeMo][4]. +You can also submit [external evaluations][3] using Datadog's API. This mechanism is great if you have your own evaluation system, but would like to centralize that information within Datadog. + +### Evaluation integrations + +Datadog also supports integrations with some 3rd party evaluation frameworks, such as [Ragas][4] and [NeMo][5]. ### Sensitive Data Scanner integration -In addition to evaluating the input and output of LLM requests, agents, workflows, or the application, LLM Observability integrates with [Sensitive Data Scanner][5], which helps prevent data leakage by identifying and redacting any sensitive information (such as personal data, financial details, or proprietary information) that may be present in any step of your LLM application. +In addition to evaluating the input and output of LLM requests, agents, workflows, or the application, LLM Observability integrates with [Sensitive Data Scanner][6], which helps prevent data leakage by identifying and redacting any sensitive information (such as personal data, financial details, or proprietary information) that may be present in any step of your LLM application. By proactively scanning for sensitive data, LLM Observability ensures that conversations remain secure and compliant with data protection regulations. This additional layer of security reinforces Datadog's commitment to maintaining the confidentiality and integration of user interactions with LLMs. ### Permissions -[LLM Observability Write permissions][6] are necessary to configure evaluations. +[`LLM Observability Write` permissions][7] are necessary to configure evaluations. -[1]: /llm_observability/evaluations/managed_evaluations -[2]: /llm_observability/evaluations/external_evaluations -[3]: /llm_observability/evaluations/ragas_evaluations -[4]: /llm_observability/evaluations/submit_nemo_evaluations -[5]: /security/sensitive_data_scanner/ -[6]: /account_management/rbac/permissions/#llm-observability -[7]: https://app.datadoghq.com/llm/settings/evaluations \ No newline at end of file +[1]: /llm_observability/evaluations/custom_llm_as_a_judge_evaluations +[2]: /llm_observability/evaluations/managed_evaluations +[3]: /llm_observability/evaluations/external_evaluations +[4]: /llm_observability/evaluations/ragas_evaluations +[5]: /llm_observability/evaluations/submit_nemo_evaluations +[6]: /security/sensitive_data_scanner/ +[7]: /account_management/rbac/permissions/#llm-observability +[8]: https://app.datadoghq.com/llm/settings/evaluations diff --git a/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations.md b/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations.md new file mode 100644 index 0000000000000..aabe36eb7e2bf --- /dev/null +++ b/content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations.md @@ -0,0 +1,161 @@ +--- +title: Custom LLM-as-a-Judge Evaluations +description: How to create custom LLM-as-a-judge evaluations, and how to use these evaluation results across LLM Observability. +further_reading: +- link: "/llm_observability/terms/" + tag: "Documentation" + text: "Learn about LLM Observability terms and concepts" +- link: "/llm_observability/setup" + tag: "Documentation" + text: "Learn how to set up LLM Observability" +- link: "/llm_observability/evaluations/managed_evaluations" + tag: "Documentation" + text: "Learn about managed evaluations" +- link: "https://www.datadoghq.com/blog/llm-evaluation-framework-best-practices/" + tag: "Blog" + text: "Building an LLM evaluation framework: best practices" +- link: "https://huggingface.co/learn/cookbook/llm_judge" + tag: "Hugging Face" + text: "Using LLM-as-a-judge for an automated and versatile evaluation" +--- + +Custom LLM-as-a-judge evaluations use an LLM to judge the performance of another LLM. You can define evaluation logic with natural language prompts, capture subjective or objective criteria (like tone, helpfulness, or factuality), and run these evaluations at scale across your traces and spans. + +## Create a custom LLM-as-a-judge evaluation + +You can create and manage custom evaluations from the [Evaluations page][1] in LLM Observability. + +1. In Datadog, navigate to the LLM Observability [Evaluations page][1]. Select **Create Evaluation**, then select **Create your own**. + {{< img src="llm_observability/evaluations/custom_llm_judge_1.png" alt="The LLM Observability Evaluations page with the Create Evaluation side panel opened. The first item, 'Create your own,' is selected. " style="width:100%;" >}} + +1. Provide a clear, descriptive **evaluation name** (for example, `factuality-check` or `tone-eval`). You will use this name when querying evaluation results. The name must be unique within your application. + +1. Use the **Account** drop-down menu to select the LLM provider and corresponding account to use for your LLM judge. To connect a new account, see [connect an LLM provider][2]. + +1. Use the **Model** drop-down menu to select a model to use for your LLM judge. + +1. Under **Evaluation Prompt** section, use the **Prompt Template** drop-down menu: + - **Create from scratch**: Use your own custom prompt (defined in the next step). + - **Failure to Answer**, **Prompt Injection**, **Sentiment**, etc.: Populate a pre-existing prompt template. You can use these templates as-is, or modify them to match your specific evaluation logic. + +1. In the **System Prompt** field, enter your custom prompt or modify a prompt template. + For custom prompts, provide clear instructions describing what the evaluator should assess. + + - Focus on a single evaluation goal + - Include 2–3 few-shot examples showing input/output pairs, expected results, and reasoning. + +{{% collapse-content title="Example custom prompt" level="h4" expanded=false id="custom-prompt-example" %}} +**System Prompt** +``` +You will be looking at interactions between a user and a budgeting AI agent. Your job is to classify the user's intent when it comes to using the budgeting AI agent. + +You will be given a Span Input, which represents the user's message to the agent, which you will then classify. Here are some examples. + +Span Input: What are the core things I should know about budgeting? +Classification: general_financial_advice + +Span Input: Did I go over budget with my grocery bills last month? +Classification: budgeting_question + +Span Input: What is the category for which I have the highest budget? +Classification: budgeting_question + +Span Input: Based on my past months, what is my ideal budget for subscriptions? +Classification: budgeting_advice + +Span Input: Raise my restaurant budget by $50 +Classification: budgeting_request + +Span Input: Help me plan a trip to the Maldives +Classification: unrelated +``` + +**User** + +``` +Span Input: {{span_input}} +``` +{{% /collapse-content %}} + +7. In the **User** field, provide your user prompt. Explicitly specify what parts of the span to evaluate: Span Input (`{{span_input}}`), Output (`{{span_output}}`), or both. + +7. Select an evaluation output type: + + - **Boolean**: True/false results (for example, "Did the model follow instructions?") + - **Score**: Numeric ratings (for example, a 1–5 scale for helpfulness) + - **Categorical**: Discrete labels (for example, "Good", "Bad", "Neutral") +