Skip to content

Add blog post: Agent Trace Evaluation with TruLens Scorers in MLflow#482

Open
debu-sinha wants to merge 7 commits intomlflow:mainfrom
debu-sinha:blog/mlflow-trulens-evaluation
Open

Add blog post: Agent Trace Evaluation with TruLens Scorers in MLflow#482
debu-sinha wants to merge 7 commits intomlflow:mainfrom
debu-sinha:blog/mlflow-trulens-evaluation

Conversation

@debu-sinha
Copy link

@debu-sinha debu-sinha commented Feb 22, 2026

Closes #460

Summary

Blog post covering the TruLens integration I contributed to MLflow's third-party scorer framework (PR #19492, merged). The blog focuses on agent trace evaluation using the Agent GPA framework, which evaluates what happens inside an agent's execution trace rather than just the final output.

Content

  • 6 agent trace scorers (PlanQuality, ToolSelection, PlanAdherence, ToolCalling, LogicalConsistency, ExecutionEfficiency) based on the Agent GPA framework
  • 4 RAG scorers (Groundedness, ContextRelevance, AnswerRelevance, Coherence)
  • Concrete travel-planning example showing how trace evaluation catches problems output-only evaluation misses
  • Provider routing across Databricks, OpenAI, and LiteLLM-supported providers
  • Combining trace and output evaluation in a single mlflow.genai.evaluate() call

Related

Checklist

  • Blog renders locally with dev server
  • All images render correctly
  • All URLs verified (9/9 pass)
  • Code snippets verified against source code
  • Reviewer name verified (Samraj Moorjani)
  • Anti-AI writing check passes (1 em-dash, 0 semicolons)
  • Thumbnail with MLflow + TruLens logos at correct ratio

@debu-sinha
Copy link
Author

@smoorjani ready for review whenever you get a chance

@debu-sinha
Copy link
Author

@dmatrix would you be able to review this one too? It covers the TruLens scorer integration for trace-aware agent evaluation.

@github-actions
Copy link

github-actions bot commented Feb 24, 2026

🚀 Netlify Preview Deployed!

Preview URL: https://pr-482--test-mlflow-website.netlify.app

Details

PR: #482
Build Action: https://github.com/mlflow/mlflow-website/actions/runs/22560779201
Deploy Action: https://github.com/mlflow/mlflow-website/actions/runs/22563345199

This preview will be updated automatically on new commits.

@debu-sinha debu-sinha force-pushed the blog/mlflow-trulens-evaluation branch from e00d4c2 to c30da4e Compare February 25, 2026 14:23
Signed-off-by: debu-sinha <debusinha2009@gmail.com>
@debu-sinha debu-sinha force-pushed the blog/mlflow-trulens-evaluation branch from c30da4e to 2db9778 Compare February 25, 2026 14:59
Signed-off-by: debu-sinha <debusinha2009@gmail.com>
@debu-sinha debu-sinha force-pushed the blog/mlflow-trulens-evaluation branch from 2db9778 to 8cbcd29 Compare February 25, 2026 15:11
Signed-off-by: debu-sinha <debusinha2009@gmail.com>
@debu-sinha
Copy link
Author

Hey @sfc-gh-jreini, the MLflow team greenlit this blog post. Since it complements the Snowflake blog you published, would you be interested in co-authoring or reviewing it? Netlify preview is live: https://pr-482--test-mlflow-website.netlify.app

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Copy link
Collaborator

@smoorjani smoorjani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Left some comments to address. Let's also throw this through some grammar checker (if you haven't already).


{/* truncate */}

The integration adds 10 scorers that bring trace-aware agent evaluation to the scorer framework for the first time. You pass an MLflow trace, and the scorer reads the full span tree (plans, tool calls, intermediate outputs, everything) using the [Agent GPA framework](https://www.snowflake.com/en/engineering-blog/ai-agent-evaluation-gpa-framework/) developed by the TruLens team at Snowflake.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is true as the deepeval/ragas integrations support multiple agentic metrics and we offer agentic judges (that dive into the trace) as well as built-in scorers that can read the trace.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Removed the claim and reworded to focus on TruLens bringing the GPA framework specifically. Also added a link to the trace-based judges docs so it's clear MLflow already has this capability.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. Linked to the trace-based judges page in the reworded intro and in the Resources section.

image: /img/blog/mlflow-trulens-evaluation-thumbnail.png
---

MLflow's [third-party scorer framework](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party/) already supports LLM-as-a-judge evaluations from [DeepEval, RAGAS, and Phoenix](/blog/third-party-scorers) across an ecosystem with 18M+ monthly PyPI downloads. Those scorers look at inputs and outputs. Did the response answer the question? Was it grounded in the context? That's enough for chatbots and RAG pipelines, but agents are a different problem. The [TruLens](https://www.trulens.org/) integration ([PR #19492](https://github.com/mlflow/mlflow/pull/19492)) changes that by evaluating what happens inside the execution trace.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
MLflow's [third-party scorer framework](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party/) already supports LLM-as-a-judge evaluations from [DeepEval, RAGAS, and Phoenix](/blog/third-party-scorers) across an ecosystem with 18M+ monthly PyPI downloads. Those scorers look at inputs and outputs. Did the response answer the question? Was it grounded in the context? That's enough for chatbots and RAG pipelines, but agents are a different problem. The [TruLens](https://www.trulens.org/) integration ([PR #19492](https://github.com/mlflow/mlflow/pull/19492)) changes that by evaluating what happens inside the execution trace.
MLflow's [third-party scorer framework](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party/) already supports LLM-as-a-judge evaluations from [DeepEval, RAGAS, and Phoenix](/blog/third-party-scorers), an ecosystem with 18M+ monthly PyPI downloads. We're excited to announce the [TruLens](https://www.trulens.org/) integration as we continue our efforts to expand support for various third-party evaluation frameworks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, applied your suggestion.


GPA stands for Goal-Plan-Action, and it evaluates the three alignment dimensions in an agent's execution:

**Goal-Plan alignment** asks: did the agent make a good strategy? `PlanQuality` checks whether the plan decomposes the goal into feasible subtasks. `ToolSelection` checks whether the agent picked the right tools for each subtask. An agent that gets the right answer by brute-forcing every available tool has a Goal-Plan problem even if the output looks fine.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Goal-Plan alignment** asks: did the agent make a good strategy? `PlanQuality` checks whether the plan decomposes the goal into feasible subtasks. `ToolSelection` checks whether the agent picked the right tools for each subtask. An agent that gets the right answer by brute-forcing every available tool has a Goal-Plan problem even if the output looks fine.
**Goal-Plan alignment** asks: did the agent make a good strategy? An agent that gets the right answer by brute-forcing every available tool has a Goal-Plan problem even if the output looks fine.
- `PlanQuality` checks whether the plan decomposes the goal into feasible subtasks.
- `ToolSelection` checks whether the agent picked the right tools for each subtask.

nit on making these into bullet points for easier reading. same for the two below

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, converted all three GPA dimensions to bullet point format.


The integration brings GPA evaluation to MLflow's scorer API, making it accessible through the same `mlflow.genai.evaluate()` interface used for all other scorers.

The integration exposes six scorers, one for each GPA dimension:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The integration exposes six scorers, one for each GPA dimension:
The integration exposes six agent scorers covering the three GPA dimensions:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

print(feedback.rationale) # Chain-of-thought reasoning explaining the score
```

| Scorer | Alignment | What it checks |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a bit redundant with the content in ## The Agent GPA Framework - any way to reduce redundancy?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Merged the two sections. The GPA Framework section now includes the table and code directly, removed the separate Agent Trace Scorers heading.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we remove the table? I feel like it just repeats the bullets above - probably a better fit for our docs than the blog. lmk if you feel differently.


The difference from the agent trace scorers is that RAG scorers work with inputs, outputs, and expectations (like every other third-party scorer), while agent trace scorers require a trace object. You can use both in the same evaluation call. Just pass traces as your data, and each scorer takes what it needs.

## Provider Routing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this section, though we can add a single line saying you can configure the model used. AFAIK tend not to document the databricks stuff on the open-source blog as many people reading are not DBX customers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the whole section. Added a one-liner about model provider support in the combined example instead.


## Combining Trace and Output Evaluation

With both trace-aware agent scorers and traditional input/output scorers in the same framework, MLflow can now evaluate agent behavior and output quality in a single API call. Where this gets interesting is mixing evaluation types in one call. Run agent trace scorers alongside RAG scorers and scorers from other frameworks:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the wording here is a bit deceptive as this has always been part of the functionality - I feel like we can keep this API call, but remove the ones above to reduce bloat (i.e., instead of one for agent, one for RAG, and one for both, just have a single one for both). WDYT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Removed the separate agent-only and RAG-only evaluate() calls. Kept a single combined example that shows both scorer types together.

## Getting Started

```bash
pip install mlflow trulens trulens-providers-litellm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pip install mlflow trulens trulens-providers-litellm
pip install mlflow>=3.10.0 trulens trulens-providers-litellm

let's just be extra clear about the version needed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

pip install mlflow trulens trulens-providers-litellm
```

The `trulens-providers-litellm` package is needed for non-Databricks model providers (OpenAI, Anthropic, etc.). If you only use the Databricks managed judge, the base `trulens` package is enough.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here about not mentioning DBX - I think it's ok to require both.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Removed the Databricks-specific note from Getting Started.

from mlflow.genai.scorers.trulens import get_scorer

# Create RAG scorers by name
scorer = get_scorer("Groundedness", model="openai:/gpt-4o")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think let's avoid this syntax (get_scorer) as it makes it look like it doesn't have first-class support - maybe if there's another judge we haven't namespaced that works when using get_scorer, we can include that as it highlights the robustness of the integration

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Removed the get_scorer example. Getting Started now shows direct class instantiation only.

Rewrote intro per suggestion, removed false 'for the first time' claim,
converted GPA descriptions to bullet points, merged redundant Agent Trace
Scorers section into GPA Framework, shrunk architecture diagram with
maxWidth, fixed span ordering and added missing retry span, updated
search_traces to use locations parameter, switched model refs to
gpt-5-mini, rewrote RAG section to use trace-based context extraction
instead of expectations, removed Provider Routing section, consolidated
to single combined evaluate example, added version pin, removed DBX
mentions from Getting Started, removed get_scorer example, linked to
trace-based judges docs.

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
@debu-sinha
Copy link
Author

debu-sinha commented Feb 28, 2026

All 19 comments addressed in d12d8be. Summary of changes:

  • Rewrote intro per your suggestion, removed false 'for the first time' claim
  • Added link to trace-based judges docs
  • Converted GPA descriptions to bullet points
  • Merged redundant Agent Trace Scorers section into GPA Framework
  • Shrunk architecture diagram with maxWidth: 680px
  • Fixed span ordering (search-book pairs sequential) and added missing retry span
  • Updated search_traces to use locations parameter
  • Switched all model refs to gpt-5-mini
  • Rewrote RAG section to use trace-based context extraction instead of expectations
  • Removed Provider Routing section, added one-liner about model support
  • Consolidated to single combined evaluate() example
  • Added mlflow>=3.10.0 version pin
  • Removed Databricks mentions from Getting Started
  • Removed get_scorer example
  • Ran grammar check

Ran the travel-planning agent demo locally with TruLens scorers
(PlanAdherence, ExecutionEfficiency, LogicalConsistency) and captured
fresh screenshots from the MLflow UI showing actual scores and rationale.
Updated alt text to match the numeric score format shown in the UI.

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Copy link
Collaborator

@smoorjani smoorjani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few more comments, but pretty much looks good to merge to me after fixing those. Let's wait for @dmatrix or @B-Step62 for a final approval before merging.


{/* truncate */}

The integration adds 10 scorers that bring the [Agent GPA framework](https://www.snowflake.com/en/engineering-blog/ai-agent-evaluation-gpa-framework/) to MLflow. You pass an MLflow trace, and the scorer reads the full span tree (plans, tool calls, intermediate outputs) to evaluate agent behavior. MLflow already supports [trace-based judges](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/custom-judges/#trace-based-judges) and agentic metrics from DeepEval and RAGAS. TruLens adds a structured three-dimensional lens (Goal, Plan, Action) developed by the TruLens team at Snowflake.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The integration adds 10 scorers that bring the [Agent GPA framework](https://www.snowflake.com/en/engineering-blog/ai-agent-evaluation-gpa-framework/) to MLflow. You pass an MLflow trace, and the scorer reads the full span tree (plans, tool calls, intermediate outputs) to evaluate agent behavior. MLflow already supports [trace-based judges](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/custom-judges/#trace-based-judges) and agentic metrics from DeepEval and RAGAS. TruLens adds a structured three-dimensional lens (Goal, Plan, Action) developed by the TruLens team at Snowflake.
The integration adds 10 scorers that bring the [Agent GPA framework](https://www.snowflake.com/en/engineering-blog/ai-agent-evaluation-gpa-framework/) to MLflow. You pass an MLflow trace, and the scorer reads the full span tree (plans, tool calls, intermediate outputs) to evaluate agent behavior. MLflow already supports [trace-based judges](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/custom-judges/#trace-based-judges) and agentic metrics from DeepEval and RAGAS, but with the TruLens integration, MLflow now supports the structured three-dimensional lens (Goal, Plan, Action) developed by the TruLens team at Snowflake.

print(feedback.rationale) # Chain-of-thought reasoning explaining the score
```

| Scorer | Alignment | What it checks |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we remove the table? I feel like it just repeats the bullets above - probably a better fit for our docs than the blog. lmk if you feel differently.


Pass a trace and nothing else. Under the hood, the integration serializes your MLflow trace to JSON and passes the full span tree to TruLens' provider, which evaluates each dimension with chain-of-thought reasoning. You get back a score and a rationale explaining what it found. No manual span extraction needed.

<img
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the back-and-forth, possible to make this a tad bigger? feels a bit too small now 😅

Span 6: book_hotel(hotel_id="H456") -> confirmed
```

Output-only evaluation gives this a pass. The trip got booked. Trace-level evaluation catches three problems:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Output-only evaluation gives this a pass. The trip got booked. Trace-level evaluation catches three problems:
Output-only evaluation gives this a pass - the trip got booked. However, trace-level evaluation catches three problems:


Each scorer runs independently and writes results to the same experiment. Results land in the MLflow assessment table alongside any other evaluation results.

<img
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two images look glued to each other - can we put some text in between explaining what's happening in each image?

/>

## Getting Started

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add some buffer text in here? Like "To get started, install MLflow and TruLens with:"

and then maybe we can point to the docs page here as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Blog Post Submission: Deep Agent Evaluation in MLflow with TruLens Scorers

2 participants