Documentation for Custom LLM-as-a-judge Evaluations #32018

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

mtullalizardi wants to merge 10 commits into master from miguel.tullalizardi/byop-documentation

Contributor

mtullalizardi commented Oct 6, 2025 •

edited

Loading

What does this PR do? What is the motivation?

This PR adds documentation for the LLM Observability "Custom LLM-as-a-judge Evaluations" feature. This is a new feature we are GAing and is already deployed. We are waiting on getting the documentation up to turn on the feature flag.

Merge instructions

Merge readiness:

Ready for merge

For Datadog employees:

Your branch name MUST follow the <name>/<description> convention and include the forward slash (/). Without this format, your pull request will not pass CI, the GitLab pipeline will not run, and you won't get a branch preview. Getting a branch preview makes it easier for us to check any issues with your PR, such as broken links.

If your branch doesn't follow this format, rename it or create a new branch and PR.

[6/5/2025] Merge queue has been disabled on the documentation repo. If you have write access to the repo, the PR has been reviewed by a Documentation team member, and all of the required checks have passed, you can use the Squash and Merge button to merge the PR. If you don't have write access, or you need help, reach out in the #documentation channel in Slack.

Additional notes


          initial commit

5036d05

mtullalizardi force-pushed the miguel.tullalizardi/byop-documentation branch from a0d0a2b to 5036d05 Compare

October 6, 2025 20:25

Contributor

github-actions bot commented Oct 6, 2025 •

edited

Loading

Preview links (active after the `build_preview` check completes)

New or renamed files

https://docs-staging.datadoghq.com/miguel.tullalizardi/byop-documentation/llm_observability/evaluations/custom_llm_as_a_judge_evaluations

Modified Files

https://docs-staging.datadoghq.com/miguel.tullalizardi/byop-documentation/llm_observability/evaluations/


          improve

311fb99

github-actions bot added the Images label

mtullalizardi added 2 commits

October 6, 2025 17:15


          undo changes to managed_evaluations file

34b1111


          add to side panel menu

50fa009

github-actions bot added the Architecture label


          be more consistent with capitalization

14b2a7a

mtullalizardi marked this pull request as ready for review

October 6, 2025 21:46

mtullalizardi requested a review from a team as a code owner

October 6, 2025 21:46

hestonhoffman added the editorial review label

Contributor

hestonhoffman commented Oct 6, 2025

Hi! Added a ticket for this one here.

gsvigruha reviewed

View reviewed changes

config/_default/menus/main.es.yaml Outdated

    
                    parent: llm_obs

                    identifier: llm_obs_evaluations

                    weight: 4

                  - name: Custom LLM-as-a-Judge

Contributor

gsvigruha Oct 7, 2025

you don't need to change these, just the english one

Contributor Author

mtullalizardi Oct 7, 2025

done

config/_default/menus/main.fr.yaml Outdated

    
                    parent: llm_obs

                    identifier: llm_obs_evaluations

                    weight: 4

                  - name: Custom LLM-as-a-Judge

Contributor

gsvigruha Oct 7, 2025

or this

Contributor Author

mtullalizardi Oct 7, 2025

done

config/_default/menus/main.ja.yaml Outdated

    
                    parent: llm_obs

                    identifier: llm_obs_evaluations

                    weight: 4

                  - name: Custom LLM-as-a-Judge

Contributor

gsvigruha Oct 7, 2025

or this

Contributor Author

mtullalizardi Oct 7, 2025

done

config/_default/menus/main.ko.yaml Outdated

    
                    parent: llm_obs

                    identifier: llm_obs_evaluations

                    weight: 4

                  - name: Custom LLM-as-a-Judge

Contributor

gsvigruha Oct 7, 2025

or this

Contributor Author

mtullalizardi Oct 7, 2025

done

content/en/llm_observability/evaluations/_index.md Outdated

    
              ### Custom LLM-as-a-Judge Evaluations

              [Custom LLM-as-a-Judge Evaluations][1] allow you to define your own evaluation logic using natural language prompts. You can create custom evaluations to assess subjective or objective criteria—like tone, helpfulness, or factuality—and run them at scale across your traces and spans.

Contributor

gsvigruha Oct 7, 2025

spaces around dashes criteria - like tone, helpfulness, or factuality - and

Contributor Author

mtullalizardi Oct 7, 2025

done

content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations.md Outdated

    
              ## Overview

              Custom LLM-as-a-Judge Evaluations let you define your own evaluation logic to automatically assess your LLM applications. You can use natural language prompts to capture subjective or objective criteria—like tone, helpfulness, or factuality—and run them at scale across your traces and spans.

Contributor

gsvigruha Oct 7, 2025

criteria - like tone, helpfulness, or factuality - and

Contributor Author

mtullalizardi Oct 7, 2025

done

content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations.md Outdated

    
              You define:

              - The criteria (via prompt text)

              - What is evaluated (e.g., a span's output)

              - The model (e.g., GPT-4o)

Contributor

gsvigruha Oct 7, 2025

nit: GPT-4o - format as code using backticks

Contributor Author

mtullalizardi Oct 7, 2025

done

content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations.md Outdated

    
              - The criteria (via prompt text)

              - What is evaluated (e.g., a span's output)

              - The model (e.g., GPT-4o)

              - The output type (boolean, numeric score, or categorical label)

Contributor

gsvigruha Oct 7, 2025

same format boolean, score and categorical as code since they refer to code concepts in our app

Contributor Author

mtullalizardi Oct 7, 2025

done

content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations.md Outdated

    
              - The model (e.g., GPT-4o)

              - The output type (boolean, numeric score, or categorical label)

              Datadog then runs this evaluation logic automatically against your spans, recording results as structured metrics that you can query, visualize, and monitor.

Contributor

gsvigruha Oct 7, 2025

maybe structured evaluation metrics. "metrics" means this metric at datadog, we brand these things as evaluations on the UI or in the query language

Contributor Author

mtullalizardi Oct 7, 2025

Changed the wording to just "...recording results for you to query, visualize, and monitor."

content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations.md Outdated

    
              - **Score** – Numeric rating (e.g., 1–5 scale for helpfulness)

              - **Categorical** – Discrete labels (e.g., "Good", "Bad", "Neutral")

              The schema ensures your results are structured for querying and dashboarding. For Anthropic and Bedrock models, only Boolean output types are allowed.

Contributor

gsvigruha Oct 7, 2025

we should mention

the fact that structured output is openAIs structured output and it needs to be edited by the user
the keyword search logic for anthropic

Contributor Author

mtullalizardi Oct 7, 2025

Added

mtullalizardi added 2 commits

October 7, 2025 09:50


          remove from other languages

633e2ca


          greg comments

38a9343

gsvigruha approved these changes

View reviewed changes

cswatt added 2 commits

October 7, 2025 15:31


          updates

d57dc01


          missed capitalization

556de54

Contributor

cswatt commented Oct 7, 2025

I pushed a few edits:

Removed redundancies from the overview paragraph (e.g. what the user defines is evident from the creation steps, and I don't think there's really a need to call that out at the top)
Added explicit imperative instructions for what the user should do on the page (e.g. "use the drop-down menu")
Ensured that all names of UI elements are strictly aligned with actual UI text (e.g. changed "True Keywords" to "True keywords")
For the output structure step, added tabs for each model. The user can now simply select their model and receive imperative instructions.
Edited alt text so that it describes the image rather than gives the user an instruction
Other smaller style edits

cswatt approved these changes

View reviewed changes


          last few edits

2bb6ebe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Architecture editorial review Images