-
Notifications
You must be signed in to change notification settings - Fork 30
Add blog post: Agent Trace Evaluation with TruLens Scorers in MLflow #482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 4 commits
34c8cdf
8cbcd29
1d9e21b
9bfa376
d12d8be
6b89e09
28327e2
160432c
95ff8f4
804a56c
666509b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,229 @@ | ||||||||||
| --- | ||||||||||
| title: "Agent Trace Evaluation with TruLens Scorers in MLflow" | ||||||||||
| description: Score agent plans, tool calls, and reasoning with TruLens GPA framework through mlflow.genai.evaluate(). | ||||||||||
| slug: mlflow-trulens-evaluation | ||||||||||
| authors: [debu-sinha] | ||||||||||
| tags: [genai, evaluation, trulens, agents, tracing] | ||||||||||
| thumbnail: /img/blog/mlflow-trulens-evaluation-thumbnail.png | ||||||||||
| image: /img/blog/mlflow-trulens-evaluation-thumbnail.png | ||||||||||
| --- | ||||||||||
|
|
||||||||||
| MLflow's [third-party scorer framework](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party/) already supports LLM-as-a-judge evaluations from [DeepEval, RAGAS, and Phoenix](/blog/third-party-scorers) across an ecosystem with 18M+ monthly PyPI downloads. Those scorers look at inputs and outputs. Did the response answer the question? Was it grounded in the context? That's enough for chatbots and RAG pipelines, but agents are a different problem. The [TruLens](https://www.trulens.org/) integration ([PR #19492](https://github.com/mlflow/mlflow/pull/19492)) changes that by evaluating what happens inside the execution trace. | ||||||||||
|
|
||||||||||
| An agent doesn't just produce an answer. It makes a plan, picks tools, executes a multi-step workflow, and adapts when steps fail. A correct final answer can mask a flawed plan, redundant tool calls, or broken reasoning along the way. To catch those problems, you need to evaluate what happened _inside_ the execution trace, not just what came out the other end. | ||||||||||
|
|
||||||||||
| {/* truncate */} | ||||||||||
|
|
||||||||||
| The integration adds 10 scorers that bring trace-aware agent evaluation to the scorer framework for the first time. You pass an MLflow trace, and the scorer reads the full span tree (plans, tool calls, intermediate outputs, everything) using the [Agent GPA framework](https://www.snowflake.com/en/engineering-blog/ai-agent-evaluation-gpa-framework/) developed by the TruLens team at Snowflake. | ||||||||||
|
||||||||||
|
|
||||||||||
| ## The Agent GPA Framework | ||||||||||
|
|
||||||||||
| GPA stands for Goal-Plan-Action, and it evaluates the three alignment dimensions in an agent's execution: | ||||||||||
|
|
||||||||||
| **Goal-Plan alignment** asks: did the agent make a good strategy? `PlanQuality` checks whether the plan decomposes the goal into feasible subtasks. `ToolSelection` checks whether the agent picked the right tools for each subtask. An agent that gets the right answer by brute-forcing every available tool has a Goal-Plan problem even if the output looks fine. | ||||||||||
|
||||||||||
| **Goal-Plan alignment** asks: did the agent make a good strategy? `PlanQuality` checks whether the plan decomposes the goal into feasible subtasks. `ToolSelection` checks whether the agent picked the right tools for each subtask. An agent that gets the right answer by brute-forcing every available tool has a Goal-Plan problem even if the output looks fine. | |
| **Goal-Plan alignment** asks: did the agent make a good strategy? An agent that gets the right answer by brute-forcing every available tool has a Goal-Plan problem even if the output looks fine. | |
| - `PlanQuality` checks whether the plan decomposes the goal into feasible subtasks. | |
| - `ToolSelection` checks whether the agent picked the right tools for each subtask. |
nit on making these into bullet points for easier reading. same for the two below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, converted all three GPA dimensions to bullet point format.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The integration exposes six scorers, one for each GPA dimension: | |
| The integration exposes six agent scorers covering the three GPA dimensions: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a bit redundant with the content in ## The Agent GPA Framework - any way to reduce redundancy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Merged the two sections. The GPA Framework section now includes the table and code directly, removed the separate Agent Trace Scorers heading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we remove the table? I feel like it just repeats the bullets above - probably a better fit for our docs than the blog. lmk if you feel differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make this smaller so it fits on a laptop screen? I noticed I had to scroll thru the diagram
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scaled down. Added a maxWidth: 680px constraint so it fits on a laptop screen without scrolling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for the back-and-forth, possible to make this a tad bigger? feels a bit too small now 😅
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Output-only evaluation gives this a pass. The trip got booked. Trace-level evaluation catches three problems: | |
| Output-only evaluation gives this a pass - the trip got booked. However, trace-level evaluation catches three problems: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this mean there's two spans?
Span 6: book_hotel(hotel_id="H456")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Split into Span 5 (book_hotel with null, error) and Span 6 (book_hotel with H456, confirmed).
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we swap span 3 and 4 around? because the booking of flight is currently done after hotel search
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Reordered so each search-book pair is sequential: search flights, book flight, then search hotels, book hotel. Makes the plan adherence issue clearer.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| traces = mlflow.search_traces(experiment_ids=["1"]) | |
| traces = mlflow.search_traces(locations=["..."]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ExecutionEfficiency(model="openai:/gpt-4o"), | |
| ExecutionEfficiency(model="openai:/gpt-5-mini"), |
nit: let's use the latest models & generally smaller models for judges (due to cost). Unless you believe these judges need larger models to work well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, updated all model references to gpt-5-mini.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two images look glued to each other - can we put some text in between explaining what's happening in each image?
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't RAG also need to look at the trace? inputs/outputs don't contain the context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. The TruLens RAG scorers extract context from retrieval spans in the trace via extract_retrieval_context_from_trace. Rewrote the RAG examples to show trace-based context extraction instead of manual expectations.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we had changed this such that the scorer would look at retrieval spans for this information? Expectations are ground-truths and I don't think context is a ground-truth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, context isn't a ground truth. Removed the expectations-based example. The updated examples pass a trace, and the scorer extracts context from retrieval spans automatically.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a rationale we can print as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Added print(feedback.rationale) to both the agent and RAG examples.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this section, though we can add a single line saying you can configure the model used. AFAIK tend not to document the databricks stuff on the open-source blog as many people reading are not DBX customers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the whole section. Added a one-liner about model provider support in the combined example instead.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the wording here is a bit deceptive as this has always been part of the functionality - I feel like we can keep this API call, but remove the ones above to reduce bloat (i.e., instead of one for agent, one for RAG, and one for both, just have a single one for both). WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Removed the separate agent-only and RAG-only evaluate() calls. Kept a single combined example that shows both scorer types together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add some buffer text in here? Like "To get started, install MLflow and TruLens with:"
and then maybe we can point to the docs page here as well.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| pip install mlflow trulens trulens-providers-litellm | |
| pip install mlflow>=3.10.0 trulens trulens-providers-litellm |
let's just be extra clear about the version needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here about not mentioning DBX - I think it's ok to require both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Removed the Databricks-specific note from Getting Started.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think let's avoid this syntax (get_scorer) as it makes it look like it doesn't have first-class support - maybe if there's another judge we haven't namespaced that works when using get_scorer, we can include that as it highlights the robustness of the integration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Removed the get_scorer example. Getting Started now shows direct class instantiation only.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to docs about this integration:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to point to the MLflow integration docs: https://www.trulens.org/component_guides/evaluation/mlflow/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, applied your suggestion.