Skip to content

Conversation

mtullalizardi
Copy link
Contributor

@mtullalizardi mtullalizardi commented Oct 6, 2025

What does this PR do? What is the motivation?

This PR adds documentation for the LLM Observability "Custom LLM-as-a-judge Evaluations" feature. This is a new feature we are GAing and is already deployed. We are waiting on getting the documentation up to turn on the feature flag.

Merge instructions

Merge readiness:

  • Ready for merge

For Datadog employees:

Your branch name MUST follow the <name>/<description> convention and include the forward slash (/). Without this format, your pull request will not pass CI, the GitLab pipeline will not run, and you won't get a branch preview. Getting a branch preview makes it easier for us to check any issues with your PR, such as broken links.

If your branch doesn't follow this format, rename it or create a new branch and PR.

[6/5/2025] Merge queue has been disabled on the documentation repo. If you have write access to the repo, the PR has been reviewed by a Documentation team member, and all of the required checks have passed, you can use the Squash and Merge button to merge the PR. If you don't have write access, or you need help, reach out in the #documentation channel in Slack.

Additional notes

@mtullalizardi mtullalizardi force-pushed the miguel.tullalizardi/byop-documentation branch from a0d0a2b to 5036d05 Compare October 6, 2025 20:25
Copy link
Contributor

github-actions bot commented Oct 6, 2025

@github-actions github-actions bot added the Images Images are added/removed with this PR label Oct 6, 2025
@github-actions github-actions bot added the Architecture Everything related to the Doc backend label Oct 6, 2025
@mtullalizardi mtullalizardi marked this pull request as ready for review October 6, 2025 21:46
@mtullalizardi mtullalizardi requested a review from a team as a code owner October 6, 2025 21:46
@hestonhoffman hestonhoffman added the editorial review Waiting on a more in-depth review label Oct 6, 2025
@hestonhoffman
Copy link
Contributor

Hi! Added a ticket for this one here.

parent: llm_obs
identifier: llm_obs_evaluations
weight: 4
- name: Custom LLM-as-a-Judge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to change these, just the english one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

parent: llm_obs
identifier: llm_obs_evaluations
weight: 4
- name: Custom LLM-as-a-Judge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

parent: llm_obs
identifier: llm_obs_evaluations
weight: 4
- name: Custom LLM-as-a-Judge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

parent: llm_obs
identifier: llm_obs_evaluations
weight: 4
- name: Custom LLM-as-a-Judge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


### Custom LLM-as-a-Judge Evaluations

[Custom LLM-as-a-Judge Evaluations][1] allow you to define your own evaluation logic using natural language prompts. You can create custom evaluations to assess subjective or objective criteria—like tone, helpfulness, or factuality—and run them at scale across your traces and spans.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spaces around dashes criteria - like tone, helpfulness, or factuality - and

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


## Overview

Custom LLM-as-a-Judge Evaluations let you define your own evaluation logic to automatically assess your LLM applications. You can use natural language prompts to capture subjective or objective criteria—like tone, helpfulness, or factuality—and run them at scale across your traces and spans.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

criteria - like tone, helpfulness, or factuality - and

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

You define:
- The criteria (via prompt text)
- What is evaluated (e.g., a span's output)
- The model (e.g., GPT-4o)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: GPT-4o - format as code using backticks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

- The criteria (via prompt text)
- What is evaluated (e.g., a span's output)
- The model (e.g., GPT-4o)
- The output type (boolean, numeric score, or categorical label)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same format boolean, score and categorical as code since they refer to code concepts in our app

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

- The model (e.g., GPT-4o)
- The output type (boolean, numeric score, or categorical label)

Datadog then runs this evaluation logic automatically against your spans, recording results as structured metrics that you can query, visualize, and monitor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe structured evaluation metrics. "metrics" means this metric at datadog, we brand these things as evaluations on the UI or in the query language

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the wording to just "...recording results for you to query, visualize, and monitor."

- **Score** – Numeric rating (e.g., 1–5 scale for helpfulness)
- **Categorical** – Discrete labels (e.g., "Good", "Bad", "Neutral")

The schema ensures your results are structured for querying and dashboarding. For Anthropic and Bedrock models, only Boolean output types are allowed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should mention

  1. the fact that structured output is openAIs structured output and it needs to be edited by the user
  2. the keyword search logic for anthropic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@cswatt
Copy link
Contributor

cswatt commented Oct 7, 2025

I pushed a few edits:

  • Removed redundancies from the overview paragraph (e.g. what the user defines is evident from the creation steps, and I don't think there's really a need to call that out at the top)
  • Added explicit imperative instructions for what the user should do on the page (e.g. "use the drop-down menu")
  • Ensured that all names of UI elements are strictly aligned with actual UI text (e.g. changed "True Keywords" to "True keywords")
  • For the output structure step, added tabs for each model. The user can now simply select their model and receive imperative instructions.
  • Edited alt text so that it describes the image rather than gives the user an instruction
  • Other smaller style edits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Architecture Everything related to the Doc backend editorial review Waiting on a more in-depth review Images Images are added/removed with this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants