-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Documentation for Custom LLM-as-a-judge Evaluations #32018
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
a0d0a2b
to
5036d05
Compare
Preview links (active after the
|
Hi! Added a ticket for this one here. |
config/_default/menus/main.es.yaml
Outdated
parent: llm_obs | ||
identifier: llm_obs_evaluations | ||
weight: 4 | ||
- name: Custom LLM-as-a-Judge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to change these, just the english one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
config/_default/menus/main.fr.yaml
Outdated
parent: llm_obs | ||
identifier: llm_obs_evaluations | ||
weight: 4 | ||
- name: Custom LLM-as-a-Judge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
config/_default/menus/main.ja.yaml
Outdated
parent: llm_obs | ||
identifier: llm_obs_evaluations | ||
weight: 4 | ||
- name: Custom LLM-as-a-Judge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
config/_default/menus/main.ko.yaml
Outdated
parent: llm_obs | ||
identifier: llm_obs_evaluations | ||
weight: 4 | ||
- name: Custom LLM-as-a-Judge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
### Custom LLM-as-a-Judge Evaluations | ||
|
||
[Custom LLM-as-a-Judge Evaluations][1] allow you to define your own evaluation logic using natural language prompts. You can create custom evaluations to assess subjective or objective criteria—like tone, helpfulness, or factuality—and run them at scale across your traces and spans. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spaces around dashes criteria - like tone, helpfulness, or factuality - and
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
## Overview | ||
|
||
Custom LLM-as-a-Judge Evaluations let you define your own evaluation logic to automatically assess your LLM applications. You can use natural language prompts to capture subjective or objective criteria—like tone, helpfulness, or factuality—and run them at scale across your traces and spans. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
criteria - like tone, helpfulness, or factuality - and
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
You define: | ||
- The criteria (via prompt text) | ||
- What is evaluated (e.g., a span's output) | ||
- The model (e.g., GPT-4o) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: GPT-4o
- format as code using backticks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
- The criteria (via prompt text) | ||
- What is evaluated (e.g., a span's output) | ||
- The model (e.g., GPT-4o) | ||
- The output type (boolean, numeric score, or categorical label) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same format boolean, score and categorical as code since they refer to code concepts in our app
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
- The model (e.g., GPT-4o) | ||
- The output type (boolean, numeric score, or categorical label) | ||
|
||
Datadog then runs this evaluation logic automatically against your spans, recording results as structured metrics that you can query, visualize, and monitor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe structured evaluation metrics. "metrics" means this metric at datadog, we brand these things as evaluations on the UI or in the query language
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the wording to just "...recording results for you to query, visualize, and monitor."
- **Score** – Numeric rating (e.g., 1–5 scale for helpfulness) | ||
- **Categorical** – Discrete labels (e.g., "Good", "Bad", "Neutral") | ||
|
||
The schema ensures your results are structured for querying and dashboarding. For Anthropic and Bedrock models, only Boolean output types are allowed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should mention
- the fact that structured output is openAIs structured output and it needs to be edited by the user
- the keyword search logic for anthropic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
I pushed a few edits:
|
What does this PR do? What is the motivation?
This PR adds documentation for the LLM Observability "Custom LLM-as-a-judge Evaluations" feature. This is a new feature we are GAing and is already deployed. We are waiting on getting the documentation up to turn on the feature flag.
Merge instructions
Merge readiness:
For Datadog employees:
Your branch name MUST follow the
<name>/<description>
convention and include the forward slash (/
). Without this format, your pull request will not pass CI, the GitLab pipeline will not run, and you won't get a branch preview. Getting a branch preview makes it easier for us to check any issues with your PR, such as broken links.If your branch doesn't follow this format, rename it or create a new branch and PR.
[6/5/2025] Merge queue has been disabled on the documentation repo. If you have write access to the repo, the PR has been reviewed by a Documentation team member, and all of the required checks have passed, you can use the Squash and Merge button to merge the PR. If you don't have write access, or you need help, reach out in the #documentation channel in Slack.
Additional notes