Blog for automatic evaluation by dbczumar · Pull Request #484 · mlflow/mlflow-website

dbczumar · 2026-02-24T01:55:02Z

No description provided.

Signed-off-by: dbczumar <corey.zumar@databricks.com>

github-actions · 2026-02-24T02:02:38Z

🚀 Netlify Preview Deployed!

Preview URL: https://pr-484--test-mlflow-website.netlify.app

Details

PR: #484
Build Action: https://github.com/mlflow/mlflow-website/actions/runs/22333723767
Deploy Action: https://github.com/mlflow/mlflow-website/actions/runs/22333759816

This preview will be updated automatically on new commits.

Signed-off-by: dbczumar <corey.zumar@databricks.com>

B-Step62

LGTM once the video is replaced to the correct one!

B-Step62 · 2026-02-24T03:44:35Z

website/blog/2026-02-23-automatic-evaluation/index.mdx

+
+<video width="100%" controls autoPlay loop muted>
+  <source
+    src={require("./automatic-evaluation-ui-setup.mp4").default}


It seems this video supposed to be a different one.

B-Step62 · 2026-02-24T03:45:03Z

website/blog/2026-02-23-automatic-evaluation/index.mdx

+image: img/blog/mlflow-automatic-evaluation-thumbnail.png
+---
+
+We're excited to introduce **Automatic Evaluation**, a new capability in MLflow that runs LLM judges on your agent traces and conversations as they're logged—no code required.


nit: can we replace em dashes?:)

B-Step62 · 2026-02-24T03:46:53Z

website/blog/2026-02-23-automatic-evaluation/index.mdx

+
+Automatic evaluation runs [**LLM judges**](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/#llms-as-judges) on your traces and conversations as they're logged to MLflow. You configure judges once, and they run automatically on new traces as they arrive. Judges run asynchronously on the MLflow server, so your application's performance is unaffected.
+
+An LLM judge is a language model that evaluates the outputs of an agent or LLM application against specific criteria—safety, groundedness, tool usage, user satisfaction, and more. MLflow provides a large ecosystem of [**built-in judges**](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/predefined/) for common evaluation criteria, plus integrations with [DeepEval](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#deepeval), [RAGAS](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#ragas), and [Phoenix Evals](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#arize-phoenix-evals). The [`make_judge()`](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/custom-judges/) API also enables you to create custom judges from natural language criteria.


Suggested change

An LLM judge is a language model that evaluates the outputs of an agent or LLM application against specific criteria—safety, groundedness, tool usage, user satisfaction, and more. MLflow provides a large ecosystem of [**built-in judges**](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/predefined/) for common evaluation criteria, plus integrations with [DeepEval](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#deepeval), [RAGAS](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#ragas), and [Phoenix Evals](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#arize-phoenix-evals). The [`make_judge()`](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/custom-judges/) API also enables you to create custom judges from natural language criteria.

An LLM judge is a pair of a language model and a tailored prompt that evaluates the outputs of an agent or LLM application against specific criteria—safety, groundedness, tool usage, user satisfaction, and more. MLflow provides a large ecosystem of [**built-in judges**](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/predefined/) for common evaluation criteria, plus integrations with [DeepEval](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#deepeval), [RAGAS](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#ragas), and [Phoenix Evals](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#arize-phoenix-evals). The [`make_judge()`](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/custom-judges/) API also enables you to create custom judges from natural language criteria.

dbczumar added 5 commits February 23, 2026 17:51

fix

7881fab

Signed-off-by: dbczumar <corey.zumar@databricks.com>

fix

a75a3c0

Signed-off-by: dbczumar <corey.zumar@databricks.com>

fix

2ce0e5b

Signed-off-by: dbczumar <corey.zumar@databricks.com>

fix

ea1cdae

Signed-off-by: dbczumar <corey.zumar@databricks.com>

fix

6d68a6e

Signed-off-by: dbczumar <corey.zumar@databricks.com>

dbczumar added 2 commits February 23, 2026 18:10

fix

29f4742

Signed-off-by: dbczumar <corey.zumar@databricks.com>

fix

28202f6

Signed-off-by: dbczumar <corey.zumar@databricks.com>

dbczumar requested a review from B-Step62 February 24, 2026 02:12

dbczumar added 2 commits February 23, 2026 18:14

fix

307d940

Signed-off-by: dbczumar <corey.zumar@databricks.com>

fix

b87d5fc

Signed-off-by: dbczumar <corey.zumar@databricks.com>

B-Step62 approved these changes Feb 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blog for automatic evaluation#484

Blog for automatic evaluation#484
dbczumar wants to merge 9 commits intomlflow:mainfrom
dbczumar:blog_online

dbczumar commented Feb 24, 2026

Uh oh!

github-actions bot commented Feb 24, 2026 •

edited

Loading

Uh oh!

B-Step62 left a comment

Uh oh!

B-Step62 Feb 24, 2026

Uh oh!

B-Step62 Feb 24, 2026

Uh oh!

B-Step62 Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		Automatic evaluation runs [LLM judges](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/#llms-as-judges) on your traces and conversations as they're logged to MLflow. You configure judges once, and they run automatically on new traces as they arrive. Judges run asynchronously on the MLflow server, so your application's performance is unaffected.

		An LLM judge is a language model that evaluates the outputs of an agent or LLM application against specific criteria—safety, groundedness, tool usage, user satisfaction, and more. MLflow provides a large ecosystem of [built-in judges](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/predefined/) for common evaluation criteria, plus integrations with [DeepEval](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#deepeval), [RAGAS](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#ragas), and [Phoenix Evals](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#arize-phoenix-evals). The [`make_judge()`](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/custom-judges/) API also enables you to create custom judges from natural language criteria.

Conversation

dbczumar commented Feb 24, 2026

Uh oh!

github-actions bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

B-Step62 left a comment

Choose a reason for hiding this comment

Uh oh!

B-Step62 Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

B-Step62 Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

B-Step62 Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Feb 24, 2026 •

edited

Loading