Add blog post: Agent Trace Evaluation with TruLens Scorers in MLflow by debu-sinha · Pull Request #482 · mlflow/mlflow-website

debu-sinha · 2026-02-22T17:15:54Z

Closes #460

Summary

Blog post covering the TruLens integration I contributed to MLflow's third-party scorer framework (PR #19492, merged). The blog focuses on agent trace evaluation using the Agent GPA framework, which evaluates what happens inside an agent's execution trace rather than just the final output.

Content

6 agent trace scorers (PlanQuality, ToolSelection, PlanAdherence, ToolCalling, LogicalConsistency, ExecutionEfficiency) based on the Agent GPA framework
4 RAG scorers (Groundedness, ContextRelevance, AnswerRelevance, Coherence)
Concrete travel-planning example showing how trace evaluation catches problems output-only evaluation misses
Provider routing across Databricks, OpenAI, and LiteLLM-supported providers
Combining trace and output evaluation in a single mlflow.genai.evaluate() call

Checklist

Blog renders locally with dev server
All images render correctly
All URLs verified (9/9 pass)
Code snippets verified against source code
Reviewer name verified (Samraj Moorjani)
Anti-AI writing check passes (1 em-dash, 0 semicolons)
Thumbnail with MLflow + TruLens logos at correct ratio

debu-sinha · 2026-02-23T15:42:02Z

@smoorjani ready for review whenever you get a chance

debu-sinha · 2026-02-23T20:46:13Z

@dmatrix would you be able to review this one too? It covers the TruLens scorer integration for trace-aware agent evaluation.

github-actions · 2026-02-24T08:52:37Z

🚀 Netlify Preview Deployed!

Preview URL: https://pr-482--test-mlflow-website.netlify.app

Details

PR: #482
Build Action: https://github.com/mlflow/mlflow-website/actions/runs/22560779201
Deploy Action: https://github.com/mlflow/mlflow-website/actions/runs/22563345199

This preview will be updated automatically on new commits.

Signed-off-by: debu-sinha <debusinha2009@gmail.com>

debu-sinha · 2026-02-27T08:09:27Z

Hey @sfc-gh-jreini, the MLflow team greenlit this blog post. Since it complements the Snowflake blog you published, would you be interested in co-authoring or reviewing it? Netlify preview is live: https://pr-482--test-mlflow-website.netlify.app

Signed-off-by: debu-sinha <debusinha2009@gmail.com>

smoorjani

Looks great! Left some comments to address. Let's also throw this through some grammar checker (if you haven't already).

smoorjani · 2026-02-27T05:16:51Z

website/blog/2026-03-04-mlflow-trulens-evaluation/index.mdx

+
+{/* truncate */}
+
+The integration adds 10 scorers that bring trace-aware agent evaluation to the scorer framework for the first time. You pass an MLflow trace, and the scorer reads the full span tree (plans, tool calls, intermediate outputs, everything) using the [Agent GPA framework](https://www.snowflake.com/en/engineering-blog/ai-agent-evaluation-gpa-framework/) developed by the TruLens team at Snowflake.


I don't think this is true as the deepeval/ragas integrations support multiple agentic metrics and we offer agentic judges (that dive into the trace) as well as built-in scorers that can read the trace.

can we also point to our agentic judges (free advertising :) ) : https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/custom-judges/#trace-based-judges

Good catch. Removed the claim and reworded to focus on TruLens bringing the GPA framework specifically. Also added a link to the trace-based judges docs so it's clear MLflow already has this capability.

Added. Linked to the trace-based judges page in the reworded intro and in the Resources section.

smoorjani · 2026-02-27T05:19:16Z

website/blog/2026-03-04-mlflow-trulens-evaluation/index.mdx

+image: /img/blog/mlflow-trulens-evaluation-thumbnail.png
+---
+
+MLflow's [third-party scorer framework](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party/) already supports LLM-as-a-judge evaluations from [DeepEval, RAGAS, and Phoenix](/blog/third-party-scorers) across an ecosystem with 18M+ monthly PyPI downloads. Those scorers look at inputs and outputs. Did the response answer the question? Was it grounded in the context? That's enough for chatbots and RAG pipelines, but agents are a different problem. The [TruLens](https://www.trulens.org/) integration ([PR #19492](https://github.com/mlflow/mlflow/pull/19492)) changes that by evaluating what happens inside the execution trace.


Suggested change

MLflow's [third-party scorer framework](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party/) already supports LLM-as-a-judge evaluations from [DeepEval, RAGAS, and Phoenix](/blog/third-party-scorers) across an ecosystem with 18M+ monthly PyPI downloads. Those scorers look at inputs and outputs. Did the response answer the question? Was it grounded in the context? That's enough for chatbots and RAG pipelines, but agents are a different problem. The [TruLens](https://www.trulens.org/) integration ([PR #19492](https://github.com/mlflow/mlflow/pull/19492)) changes that by evaluating what happens inside the execution trace.

MLflow's [third-party scorer framework](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party/) already supports LLM-as-a-judge evaluations from [DeepEval, RAGAS, and Phoenix](/blog/third-party-scorers), an ecosystem with 18M+ monthly PyPI downloads. We're excited to announce the [TruLens](https://www.trulens.org/) integration as we continue our efforts to expand support for various third-party evaluation frameworks.

Done, applied your suggestion.

smoorjani · 2026-02-27T21:44:16Z

website/blog/2026-03-04-mlflow-trulens-evaluation/index.mdx

+
+GPA stands for Goal-Plan-Action, and it evaluates the three alignment dimensions in an agent's execution:
+
+**Goal-Plan alignment** asks: did the agent make a good strategy? `PlanQuality` checks whether the plan decomposes the goal into feasible subtasks. `ToolSelection` checks whether the agent picked the right tools for each subtask. An agent that gets the right answer by brute-forcing every available tool has a Goal-Plan problem even if the output looks fine.


Suggested change

**Goal-Plan alignment** asks: did the agent make a good strategy? `PlanQuality` checks whether the plan decomposes the goal into feasible subtasks. `ToolSelection` checks whether the agent picked the right tools for each subtask. An agent that gets the right answer by brute-forcing every available tool has a Goal-Plan problem even if the output looks fine.

**Goal-Plan alignment** asks: did the agent make a good strategy? An agent that gets the right answer by brute-forcing every available tool has a Goal-Plan problem even if the output looks fine.

- `PlanQuality` checks whether the plan decomposes the goal into feasible subtasks.

- `ToolSelection` checks whether the agent picked the right tools for each subtask.

nit on making these into bullet points for easier reading. same for the two below

Done, converted all three GPA dimensions to bullet point format.

smoorjani · 2026-02-27T21:45:54Z

website/blog/2026-03-04-mlflow-trulens-evaluation/index.mdx

+
+The integration brings GPA evaluation to MLflow's scorer API, making it accessible through the same `mlflow.genai.evaluate()` interface used for all other scorers.
+
+The integration exposes six scorers, one for each GPA dimension:


Suggested change

The integration exposes six scorers, one for each GPA dimension:

The integration exposes six agent scorers covering the three GPA dimensions:

smoorjani · 2026-02-27T21:47:11Z

website/blog/2026-03-04-mlflow-trulens-evaluation/index.mdx

+print(feedback.rationale)  # Chain-of-thought reasoning explaining the score
+```
+
+| Scorer                | Alignment   | What it checks                                                    |


is this a bit redundant with the content in ## The Agent GPA Framework - any way to reduce redundancy?

Agreed. Merged the two sections. The GPA Framework section now includes the table and code directly, removed the separate Agent Trace Scorers heading.

should we remove the table? I feel like it just repeats the bullets above - probably a better fit for our docs than the blog. lmk if you feel differently.

smoorjani · 2026-02-27T21:55:49Z

website/blog/2026-03-04-mlflow-trulens-evaluation/index.mdx

+
+The difference from the agent trace scorers is that RAG scorers work with inputs, outputs, and expectations (like every other third-party scorer), while agent trace scorers require a trace object. You can use both in the same evaluation call. Just pass traces as your data, and each scorer takes what it needs.
+
+## Provider Routing


I don't think we need this section, though we can add a single line saying you can configure the model used. AFAIK tend not to document the databricks stuff on the open-source blog as many people reading are not DBX customers.

Removed the whole section. Added a one-liner about model provider support in the combined example instead.

smoorjani · 2026-02-27T21:57:24Z

website/blog/2026-03-04-mlflow-trulens-evaluation/index.mdx

+
+## Combining Trace and Output Evaluation
+
+With both trace-aware agent scorers and traditional input/output scorers in the same framework, MLflow can now evaluate agent behavior and output quality in a single API call. Where this gets interesting is mixing evaluation types in one call. Run agent trace scorers alongside RAG scorers and scorers from other frameworks:


I think the wording here is a bit deceptive as this has always been part of the functionality - I feel like we can keep this API call, but remove the ones above to reduce bloat (i.e., instead of one for agent, one for RAG, and one for both, just have a single one for both). WDYT?

Agreed. Removed the separate agent-only and RAG-only evaluate() calls. Kept a single combined example that shows both scorer types together.

smoorjani · 2026-02-27T21:58:03Z

website/blog/2026-03-04-mlflow-trulens-evaluation/index.mdx

+## Getting Started
+
+```bash
+pip install mlflow trulens trulens-providers-litellm


Suggested change

pip install mlflow trulens trulens-providers-litellm

pip install mlflow>=3.10.0 trulens trulens-providers-litellm

let's just be extra clear about the version needed

smoorjani · 2026-02-27T21:58:18Z

website/blog/2026-03-04-mlflow-trulens-evaluation/index.mdx

+pip install mlflow trulens trulens-providers-litellm
+```
+
+The `trulens-providers-litellm` package is needed for non-Databricks model providers (OpenAI, Anthropic, etc.). If you only use the Databricks managed judge, the base `trulens` package is enough.


same here about not mentioning DBX - I think it's ok to require both.

Done. Removed the Databricks-specific note from Getting Started.

smoorjani · 2026-02-27T21:59:34Z

website/blog/2026-03-04-mlflow-trulens-evaluation/index.mdx

+from mlflow.genai.scorers.trulens import get_scorer
+
+# Create RAG scorers by name
+scorer = get_scorer("Groundedness", model="openai:/gpt-4o")


nit: I think let's avoid this syntax (get_scorer) as it makes it look like it doesn't have first-class support - maybe if there's another judge we haven't namespaced that works when using get_scorer, we can include that as it highlights the robustness of the integration

Done. Removed the get_scorer example. Getting Started now shows direct class instantiation only.

Rewrote intro per suggestion, removed false 'for the first time' claim, converted GPA descriptions to bullet points, merged redundant Agent Trace Scorers section into GPA Framework, shrunk architecture diagram with maxWidth, fixed span ordering and added missing retry span, updated search_traces to use locations parameter, switched model refs to gpt-5-mini, rewrote RAG section to use trace-based context extraction instead of expectations, removed Provider Routing section, consolidated to single combined evaluate example, added version pin, removed DBX mentions from Getting Started, removed get_scorer example, linked to trace-based judges docs. Signed-off-by: debu-sinha <debusinha2009@gmail.com>

debu-sinha · 2026-02-28T14:49:38Z

All 19 comments addressed in d12d8be. Summary of changes:

Rewrote intro per your suggestion, removed false 'for the first time' claim
Added link to trace-based judges docs
Converted GPA descriptions to bullet points
Merged redundant Agent Trace Scorers section into GPA Framework
Shrunk architecture diagram with maxWidth: 680px
Fixed span ordering (search-book pairs sequential) and added missing retry span
Updated search_traces to use locations parameter
Switched all model refs to gpt-5-mini
Rewrote RAG section to use trace-based context extraction instead of expectations
Removed Provider Routing section, added one-liner about model support
Consolidated to single combined evaluate() example
Added mlflow>=3.10.0 version pin
Removed Databricks mentions from Getting Started
Removed get_scorer example
Ran grammar check

Ran the travel-planning agent demo locally with TruLens scorers (PlanAdherence, ExecutionEfficiency, LogicalConsistency) and captured fresh screenshots from the MLflow UI showing actual scores and rationale. Updated alt text to match the numeric score format shown in the UI. Signed-off-by: debu-sinha <debusinha2009@gmail.com>

Signed-off-by: debu-sinha <debusinha2009@gmail.com>

smoorjani

Left a few more comments, but pretty much looks good to merge to me after fixing those. Let's wait for @dmatrix or @B-Step62 for a final approval before merging.

smoorjani · 2026-03-02T05:58:08Z

website/blog/2026-03-04-mlflow-trulens-evaluation/index.mdx

+
+{/* truncate */}
+
+The integration adds 10 scorers that bring the [Agent GPA framework](https://www.snowflake.com/en/engineering-blog/ai-agent-evaluation-gpa-framework/) to MLflow. You pass an MLflow trace, and the scorer reads the full span tree (plans, tool calls, intermediate outputs) to evaluate agent behavior. MLflow already supports [trace-based judges](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/custom-judges/#trace-based-judges) and agentic metrics from DeepEval and RAGAS. TruLens adds a structured three-dimensional lens (Goal, Plan, Action) developed by the TruLens team at Snowflake.


Suggested change

The integration adds 10 scorers that bring the [Agent GPA framework](https://www.snowflake.com/en/engineering-blog/ai-agent-evaluation-gpa-framework/) to MLflow. You pass an MLflow trace, and the scorer reads the full span tree (plans, tool calls, intermediate outputs) to evaluate agent behavior. MLflow already supports [trace-based judges](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/custom-judges/#trace-based-judges) and agentic metrics from DeepEval and RAGAS. TruLens adds a structured three-dimensional lens (Goal, Plan, Action) developed by the TruLens team at Snowflake.

The integration adds 10 scorers that bring the [Agent GPA framework](https://www.snowflake.com/en/engineering-blog/ai-agent-evaluation-gpa-framework/) to MLflow. You pass an MLflow trace, and the scorer reads the full span tree (plans, tool calls, intermediate outputs) to evaluate agent behavior. MLflow already supports [trace-based judges](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/custom-judges/#trace-based-judges) and agentic metrics from DeepEval and RAGAS, but with the TruLens integration, MLflow now supports the structured three-dimensional lens (Goal, Plan, Action) developed by the TruLens team at Snowflake.

smoorjani · 2026-03-02T05:58:41Z

website/blog/2026-03-04-mlflow-trulens-evaluation/index.mdx

+print(feedback.rationale)  # Chain-of-thought reasoning explaining the score
+```
+
+| Scorer                | Alignment   | What it checks                                                    |


should we remove the table? I feel like it just repeats the bullets above - probably a better fit for our docs than the blog. lmk if you feel differently.

smoorjani · 2026-03-02T05:59:38Z

website/blog/2026-03-04-mlflow-trulens-evaluation/index.mdx

+
+Pass a trace and nothing else. Under the hood, the integration serializes your MLflow trace to JSON and passes the full span tree to TruLens' provider, which evaluates each dimension with chain-of-thought reasoning. You get back a score and a rationale explaining what it found. No manual span extraction needed.
+
+<img


sorry for the back-and-forth, possible to make this a tad bigger? feels a bit too small now 😅

smoorjani · 2026-03-02T06:00:18Z

website/blog/2026-03-04-mlflow-trulens-evaluation/index.mdx

+Span 6: book_hotel(hotel_id="H456") -> confirmed
+```
+
+Output-only evaluation gives this a pass. The trip got booked. Trace-level evaluation catches three problems:


Suggested change

Output-only evaluation gives this a pass. The trip got booked. Trace-level evaluation catches three problems:

Output-only evaluation gives this a pass - the trip got booked. However, trace-level evaluation catches three problems:

smoorjani · 2026-03-02T06:02:18Z

website/blog/2026-03-04-mlflow-trulens-evaluation/index.mdx

+
+Each scorer runs independently and writes results to the same experiment. Results land in the MLflow assessment table alongside any other evaluation results.
+
+<img


These two images look glued to each other - can we put some text in between explaining what's happening in each image?

smoorjani · 2026-03-02T06:02:51Z

website/blog/2026-03-04-mlflow-trulens-evaluation/index.mdx

+/>
+
+## Getting Started
+


can we add some buffer text in here? Like "To get started, install MLflow and TruLens with:"

and then maybe we can point to the docs page here as well.

debu-sinha force-pushed the blog/mlflow-trulens-evaluation branch from e00d4c2 to c30da4e Compare February 25, 2026 14:23

Add TruLens agent trace evaluation blog post

34c8cdf

Signed-off-by: debu-sinha <debusinha2009@gmail.com>

debu-sinha force-pushed the blog/mlflow-trulens-evaluation branch from c30da4e to 2db9778 Compare February 25, 2026 14:59

Add TruLens blog thumbnail

8cbcd29

Signed-off-by: debu-sinha <debusinha2009@gmail.com>

debu-sinha force-pushed the blog/mlflow-trulens-evaluation branch from 2db9778 to 8cbcd29 Compare February 25, 2026 15:11

Update pipeline diagram with TruLens brand colors and boundaries

1d9e21b

Signed-off-by: debu-sinha <debusinha2009@gmail.com>

debu-sinha mentioned this pull request Feb 27, 2026

Blog Post Submission: Deep Agent Evaluation in MLflow with TruLens Scorers #460

Open

9 tasks

Run prettier formatting on blog post

9bfa376

Signed-off-by: debu-sinha <debusinha2009@gmail.com>

smoorjani requested changes Feb 27, 2026

View reviewed changes

debu-sinha added 2 commits February 28, 2026 10:28

Replace screenshots with gpt-5-mini evaluation results

28327e2

Signed-off-by: debu-sinha <debusinha2009@gmail.com>

smoorjani requested changes Mar 2, 2026

View reviewed changes


		{/* truncate */}

		The integration adds 10 scorers that bring trace-aware agent evaluation to the scorer framework for the first time. You pass an MLflow trace, and the scorer reads the full span tree (plans, tool calls, intermediate outputs, everything) using the [Agent GPA framework](https://www.snowflake.com/en/engineering-blog/ai-agent-evaluation-gpa-framework/) developed by the TruLens team at Snowflake.


		GPA stands for Goal-Plan-Action, and it evaluates the three alignment dimensions in an agent's execution:

		Goal-Plan alignment asks: did the agent make a good strategy? `PlanQuality` checks whether the plan decomposes the goal into feasible subtasks. `ToolSelection` checks whether the agent picked the right tools for each subtask. An agent that gets the right answer by brute-forcing every available tool has a Goal-Plan problem even if the output looks fine.


		The integration brings GPA evaluation to MLflow's scorer API, making it accessible through the same `mlflow.genai.evaluate()` interface used for all other scorers.

		The integration exposes six scorers, one for each GPA dimension:

	The integration exposes six scorers, one for each GPA dimension:
	The integration exposes six agent scorers covering the three GPA dimensions:


		The difference from the agent trace scorers is that RAG scorers work with inputs, outputs, and expectations (like every other third-party scorer), while agent trace scorers require a trace object. You can use both in the same evaluation call. Just pass traces as your data, and each scorer takes what it needs.

		## Provider Routing


		## Combining Trace and Output Evaluation

		With both trace-aware agent scorers and traditional input/output scorers in the same framework, MLflow can now evaluate agent behavior and output quality in a single API call. Where this gets interesting is mixing evaluation types in one call. Run agent trace scorers alongside RAG scorers and scorers from other frameworks:

	pip install mlflow trulens trulens-providers-litellm
	pip install mlflow>=3.10.0 trulens trulens-providers-litellm


		Pass a trace and nothing else. Under the hood, the integration serializes your MLflow trace to JSON and passes the full span tree to TruLens' provider, which evaluates each dimension with chain-of-thought reasoning. You get back a score and a rationale explaining what it found. No manual span extraction needed.

		<img

	Output-only evaluation gives this a pass. The trip got booked. Trace-level evaluation catches three problems:
	Output-only evaluation gives this a pass - the trip got booked. However, trace-level evaluation catches three problems:


		Each scorer runs independently and writes results to the same experiment. Results land in the MLflow assessment table alongside any other evaluation results.

		<img

Conversation

debu-sinha commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Content

Related

Checklist

Uh oh!

debu-sinha commented Feb 23, 2026

Uh oh!

debu-sinha commented Feb 23, 2026

Uh oh!

github-actions bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

debu-sinha commented Feb 27, 2026

Uh oh!

smoorjani left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

debu-sinha commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smoorjani left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

debu-sinha commented Feb 22, 2026 •

edited

Loading

github-actions bot commented Feb 24, 2026 •

edited

Loading

debu-sinha commented Feb 28, 2026 •

edited

Loading