Add blog post: Agent Trace Evaluation with TruLens Scorers in MLflow#482
Add blog post: Agent Trace Evaluation with TruLens Scorers in MLflow#482debu-sinha wants to merge 7 commits intomlflow:mainfrom
Conversation
|
@smoorjani ready for review whenever you get a chance |
|
@dmatrix would you be able to review this one too? It covers the TruLens scorer integration for trace-aware agent evaluation. |
|
🚀 Netlify Preview Deployed! Preview URL: https://pr-482--test-mlflow-website.netlify.app DetailsPR: #482 This preview will be updated automatically on new commits. |
e00d4c2 to
c30da4e
Compare
Signed-off-by: debu-sinha <debusinha2009@gmail.com>
c30da4e to
2db9778
Compare
Signed-off-by: debu-sinha <debusinha2009@gmail.com>
2db9778 to
8cbcd29
Compare
Signed-off-by: debu-sinha <debusinha2009@gmail.com>
|
Hey @sfc-gh-jreini, the MLflow team greenlit this blog post. Since it complements the Snowflake blog you published, would you be interested in co-authoring or reviewing it? Netlify preview is live: https://pr-482--test-mlflow-website.netlify.app |
Signed-off-by: debu-sinha <debusinha2009@gmail.com>
smoorjani
left a comment
There was a problem hiding this comment.
Looks great! Left some comments to address. Let's also throw this through some grammar checker (if you haven't already).
|
|
||
| {/* truncate */} | ||
|
|
||
| The integration adds 10 scorers that bring trace-aware agent evaluation to the scorer framework for the first time. You pass an MLflow trace, and the scorer reads the full span tree (plans, tool calls, intermediate outputs, everything) using the [Agent GPA framework](https://www.snowflake.com/en/engineering-blog/ai-agent-evaluation-gpa-framework/) developed by the TruLens team at Snowflake. |
There was a problem hiding this comment.
I don't think this is true as the deepeval/ragas integrations support multiple agentic metrics and we offer agentic judges (that dive into the trace) as well as built-in scorers that can read the trace.
There was a problem hiding this comment.
can we also point to our agentic judges (free advertising :) ) : https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/custom-judges/#trace-based-judges
There was a problem hiding this comment.
Good catch. Removed the claim and reworded to focus on TruLens bringing the GPA framework specifically. Also added a link to the trace-based judges docs so it's clear MLflow already has this capability.
There was a problem hiding this comment.
Added. Linked to the trace-based judges page in the reworded intro and in the Resources section.
| image: /img/blog/mlflow-trulens-evaluation-thumbnail.png | ||
| --- | ||
|
|
||
| MLflow's [third-party scorer framework](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party/) already supports LLM-as-a-judge evaluations from [DeepEval, RAGAS, and Phoenix](/blog/third-party-scorers) across an ecosystem with 18M+ monthly PyPI downloads. Those scorers look at inputs and outputs. Did the response answer the question? Was it grounded in the context? That's enough for chatbots and RAG pipelines, but agents are a different problem. The [TruLens](https://www.trulens.org/) integration ([PR #19492](https://github.com/mlflow/mlflow/pull/19492)) changes that by evaluating what happens inside the execution trace. |
There was a problem hiding this comment.
| MLflow's [third-party scorer framework](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party/) already supports LLM-as-a-judge evaluations from [DeepEval, RAGAS, and Phoenix](/blog/third-party-scorers) across an ecosystem with 18M+ monthly PyPI downloads. Those scorers look at inputs and outputs. Did the response answer the question? Was it grounded in the context? That's enough for chatbots and RAG pipelines, but agents are a different problem. The [TruLens](https://www.trulens.org/) integration ([PR #19492](https://github.com/mlflow/mlflow/pull/19492)) changes that by evaluating what happens inside the execution trace. | |
| MLflow's [third-party scorer framework](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party/) already supports LLM-as-a-judge evaluations from [DeepEval, RAGAS, and Phoenix](/blog/third-party-scorers), an ecosystem with 18M+ monthly PyPI downloads. We're excited to announce the [TruLens](https://www.trulens.org/) integration as we continue our efforts to expand support for various third-party evaluation frameworks. |
There was a problem hiding this comment.
Done, applied your suggestion.
|
|
||
| GPA stands for Goal-Plan-Action, and it evaluates the three alignment dimensions in an agent's execution: | ||
|
|
||
| **Goal-Plan alignment** asks: did the agent make a good strategy? `PlanQuality` checks whether the plan decomposes the goal into feasible subtasks. `ToolSelection` checks whether the agent picked the right tools for each subtask. An agent that gets the right answer by brute-forcing every available tool has a Goal-Plan problem even if the output looks fine. |
There was a problem hiding this comment.
| **Goal-Plan alignment** asks: did the agent make a good strategy? `PlanQuality` checks whether the plan decomposes the goal into feasible subtasks. `ToolSelection` checks whether the agent picked the right tools for each subtask. An agent that gets the right answer by brute-forcing every available tool has a Goal-Plan problem even if the output looks fine. | |
| **Goal-Plan alignment** asks: did the agent make a good strategy? An agent that gets the right answer by brute-forcing every available tool has a Goal-Plan problem even if the output looks fine. | |
| - `PlanQuality` checks whether the plan decomposes the goal into feasible subtasks. | |
| - `ToolSelection` checks whether the agent picked the right tools for each subtask. |
nit on making these into bullet points for easier reading. same for the two below
There was a problem hiding this comment.
Done, converted all three GPA dimensions to bullet point format.
|
|
||
| The integration brings GPA evaluation to MLflow's scorer API, making it accessible through the same `mlflow.genai.evaluate()` interface used for all other scorers. | ||
|
|
||
| The integration exposes six scorers, one for each GPA dimension: |
There was a problem hiding this comment.
| The integration exposes six scorers, one for each GPA dimension: | |
| The integration exposes six agent scorers covering the three GPA dimensions: |
| print(feedback.rationale) # Chain-of-thought reasoning explaining the score | ||
| ``` | ||
|
|
||
| | Scorer | Alignment | What it checks | |
There was a problem hiding this comment.
is this a bit redundant with the content in ## The Agent GPA Framework - any way to reduce redundancy?
There was a problem hiding this comment.
Agreed. Merged the two sections. The GPA Framework section now includes the table and code directly, removed the separate Agent Trace Scorers heading.
There was a problem hiding this comment.
should we remove the table? I feel like it just repeats the bullets above - probably a better fit for our docs than the blog. lmk if you feel differently.
|
|
||
| The difference from the agent trace scorers is that RAG scorers work with inputs, outputs, and expectations (like every other third-party scorer), while agent trace scorers require a trace object. You can use both in the same evaluation call. Just pass traces as your data, and each scorer takes what it needs. | ||
|
|
||
| ## Provider Routing |
There was a problem hiding this comment.
I don't think we need this section, though we can add a single line saying you can configure the model used. AFAIK tend not to document the databricks stuff on the open-source blog as many people reading are not DBX customers.
There was a problem hiding this comment.
Removed the whole section. Added a one-liner about model provider support in the combined example instead.
|
|
||
| ## Combining Trace and Output Evaluation | ||
|
|
||
| With both trace-aware agent scorers and traditional input/output scorers in the same framework, MLflow can now evaluate agent behavior and output quality in a single API call. Where this gets interesting is mixing evaluation types in one call. Run agent trace scorers alongside RAG scorers and scorers from other frameworks: |
There was a problem hiding this comment.
I think the wording here is a bit deceptive as this has always been part of the functionality - I feel like we can keep this API call, but remove the ones above to reduce bloat (i.e., instead of one for agent, one for RAG, and one for both, just have a single one for both). WDYT?
There was a problem hiding this comment.
Agreed. Removed the separate agent-only and RAG-only evaluate() calls. Kept a single combined example that shows both scorer types together.
| ## Getting Started | ||
|
|
||
| ```bash | ||
| pip install mlflow trulens trulens-providers-litellm |
There was a problem hiding this comment.
| pip install mlflow trulens trulens-providers-litellm | |
| pip install mlflow>=3.10.0 trulens trulens-providers-litellm |
let's just be extra clear about the version needed
| pip install mlflow trulens trulens-providers-litellm | ||
| ``` | ||
|
|
||
| The `trulens-providers-litellm` package is needed for non-Databricks model providers (OpenAI, Anthropic, etc.). If you only use the Databricks managed judge, the base `trulens` package is enough. |
There was a problem hiding this comment.
same here about not mentioning DBX - I think it's ok to require both.
There was a problem hiding this comment.
Done. Removed the Databricks-specific note from Getting Started.
| from mlflow.genai.scorers.trulens import get_scorer | ||
|
|
||
| # Create RAG scorers by name | ||
| scorer = get_scorer("Groundedness", model="openai:/gpt-4o") |
There was a problem hiding this comment.
nit: I think let's avoid this syntax (get_scorer) as it makes it look like it doesn't have first-class support - maybe if there's another judge we haven't namespaced that works when using get_scorer, we can include that as it highlights the robustness of the integration
There was a problem hiding this comment.
Done. Removed the get_scorer example. Getting Started now shows direct class instantiation only.
Rewrote intro per suggestion, removed false 'for the first time' claim, converted GPA descriptions to bullet points, merged redundant Agent Trace Scorers section into GPA Framework, shrunk architecture diagram with maxWidth, fixed span ordering and added missing retry span, updated search_traces to use locations parameter, switched model refs to gpt-5-mini, rewrote RAG section to use trace-based context extraction instead of expectations, removed Provider Routing section, consolidated to single combined evaluate example, added version pin, removed DBX mentions from Getting Started, removed get_scorer example, linked to trace-based judges docs. Signed-off-by: debu-sinha <debusinha2009@gmail.com>
|
All 19 comments addressed in d12d8be. Summary of changes:
|
Ran the travel-planning agent demo locally with TruLens scorers (PlanAdherence, ExecutionEfficiency, LogicalConsistency) and captured fresh screenshots from the MLflow UI showing actual scores and rationale. Updated alt text to match the numeric score format shown in the UI. Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Signed-off-by: debu-sinha <debusinha2009@gmail.com>
|
|
||
| {/* truncate */} | ||
|
|
||
| The integration adds 10 scorers that bring the [Agent GPA framework](https://www.snowflake.com/en/engineering-blog/ai-agent-evaluation-gpa-framework/) to MLflow. You pass an MLflow trace, and the scorer reads the full span tree (plans, tool calls, intermediate outputs) to evaluate agent behavior. MLflow already supports [trace-based judges](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/custom-judges/#trace-based-judges) and agentic metrics from DeepEval and RAGAS. TruLens adds a structured three-dimensional lens (Goal, Plan, Action) developed by the TruLens team at Snowflake. |
There was a problem hiding this comment.
| The integration adds 10 scorers that bring the [Agent GPA framework](https://www.snowflake.com/en/engineering-blog/ai-agent-evaluation-gpa-framework/) to MLflow. You pass an MLflow trace, and the scorer reads the full span tree (plans, tool calls, intermediate outputs) to evaluate agent behavior. MLflow already supports [trace-based judges](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/custom-judges/#trace-based-judges) and agentic metrics from DeepEval and RAGAS. TruLens adds a structured three-dimensional lens (Goal, Plan, Action) developed by the TruLens team at Snowflake. | |
| The integration adds 10 scorers that bring the [Agent GPA framework](https://www.snowflake.com/en/engineering-blog/ai-agent-evaluation-gpa-framework/) to MLflow. You pass an MLflow trace, and the scorer reads the full span tree (plans, tool calls, intermediate outputs) to evaluate agent behavior. MLflow already supports [trace-based judges](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/custom-judges/#trace-based-judges) and agentic metrics from DeepEval and RAGAS, but with the TruLens integration, MLflow now supports the structured three-dimensional lens (Goal, Plan, Action) developed by the TruLens team at Snowflake. |
| print(feedback.rationale) # Chain-of-thought reasoning explaining the score | ||
| ``` | ||
|
|
||
| | Scorer | Alignment | What it checks | |
There was a problem hiding this comment.
should we remove the table? I feel like it just repeats the bullets above - probably a better fit for our docs than the blog. lmk if you feel differently.
|
|
||
| Pass a trace and nothing else. Under the hood, the integration serializes your MLflow trace to JSON and passes the full span tree to TruLens' provider, which evaluates each dimension with chain-of-thought reasoning. You get back a score and a rationale explaining what it found. No manual span extraction needed. | ||
|
|
||
| <img |
There was a problem hiding this comment.
sorry for the back-and-forth, possible to make this a tad bigger? feels a bit too small now 😅
| Span 6: book_hotel(hotel_id="H456") -> confirmed | ||
| ``` | ||
|
|
||
| Output-only evaluation gives this a pass. The trip got booked. Trace-level evaluation catches three problems: |
There was a problem hiding this comment.
| Output-only evaluation gives this a pass. The trip got booked. Trace-level evaluation catches three problems: | |
| Output-only evaluation gives this a pass - the trip got booked. However, trace-level evaluation catches three problems: |
|
|
||
| Each scorer runs independently and writes results to the same experiment. Results land in the MLflow assessment table alongside any other evaluation results. | ||
|
|
||
| <img |
There was a problem hiding this comment.
These two images look glued to each other - can we put some text in between explaining what's happening in each image?
| /> | ||
|
|
||
| ## Getting Started | ||
|
|
There was a problem hiding this comment.
can we add some buffer text in here? Like "To get started, install MLflow and TruLens with:"
and then maybe we can point to the docs page here as well.
Closes #460
Summary
Blog post covering the TruLens integration I contributed to MLflow's third-party scorer framework (PR #19492, merged). The blog focuses on agent trace evaluation using the Agent GPA framework, which evaluates what happens inside an agent's execution trace rather than just the final output.
Content
mlflow.genai.evaluate()callRelated
Checklist