mlflow · debu-sinha · Feb 25, 2026 · Feb 25, 2026 · Feb 25, 2026 · Feb 27, 2026
diff --git a/website/blog/2026-03-04-mlflow-trulens-evaluation/index.mdx b/website/blog/2026-03-04-mlflow-trulens-evaluation/index.mdx
@@ -0,0 +1,229 @@
+---
+title: "Agent Trace Evaluation with TruLens Scorers in MLflow"
+description: Score agent plans, tool calls, and reasoning with TruLens GPA framework through mlflow.genai.evaluate().
+slug: mlflow-trulens-evaluation
+authors: [debu-sinha]
+tags: [genai, evaluation, trulens, agents, tracing]
+thumbnail: /img/blog/mlflow-trulens-evaluation-thumbnail.png
+image: /img/blog/mlflow-trulens-evaluation-thumbnail.png
+---
+
+MLflow's [third-party scorer framework](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party/) already supports LLM-as-a-judge evaluations from [DeepEval, RAGAS, and Phoenix](/blog/third-party-scorers) across an ecosystem with 18M+ monthly PyPI downloads. Those scorers look at inputs and outputs. Did the response answer the question? Was it grounded in the context? That's enough for chatbots and RAG pipelines, but agents are a different problem. The [TruLens](https://www.trulens.org/) integration ([PR #19492](https://github.com/mlflow/mlflow/pull/19492)) changes that by evaluating what happens inside the execution trace.
-MLflow's [third-party scorer framework](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party/) already supports LLM-as-a-judge evaluations from [DeepEval, RAGAS, and Phoenix](/blog/third-party-scorers) across an ecosystem with 18M+ monthly PyPI downloads. Those scorers look at inputs and outputs. Did the response answer the question? Was it grounded in the context? That's enough for chatbots and RAG pipelines, but agents are a different problem. The [TruLens](https://www.trulens.org/) integration ([PR #19492](https://github.com/mlflow/mlflow/pull/19492)) changes that by evaluating what happens inside the execution trace.
+MLflow's [third-party scorer framework](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party/) already supports LLM-as-a-judge evaluations from [DeepEval, RAGAS, and Phoenix](/blog/third-party-scorers), an ecosystem with 18M+ monthly PyPI downloads. We're excited to announce the [TruLens](https://www.trulens.org/) integration as we continue our efforts to expand support for various third-party evaluation frameworks.
-MLflow's [third-party scorer framework](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party/) already supports LLM-as-a-judge evaluations from [DeepEval, RAGAS, and Phoenix](/blog/third-party-scorers) across an ecosystem with 18M+ monthly PyPI downloads. Those scorers look at inputs and outputs. Did the response answer the question? Was it grounded in the context? That's enough for chatbots and RAG pipelines, but agents are a different problem. The [TruLens](https://www.trulens.org/) integration ([PR #19492](https://github.com/mlflow/mlflow/pull/19492)) changes that by evaluating what happens inside the execution trace.
+MLflow's [third-party scorer framework](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party/) already supports LLM-as-a-judge evaluations from [DeepEval, RAGAS, and Phoenix](/blog/third-party-scorers), an ecosystem with 18M+ monthly PyPI downloads. We're excited to announce the [TruLens](https://www.trulens.org/) integration as we continue our efforts to expand support for various third-party evaluation frameworks.
+
+An agent doesn't just produce an answer. It makes a plan, picks tools, executes a multi-step workflow, and adapts when steps fail. A correct final answer can mask a flawed plan, redundant tool calls, or broken reasoning along the way. To catch those problems, you need to evaluate what happened _inside_ the execution trace, not just what came out the other end.
+
+{/* truncate */}
+
+The integration adds 10 scorers that bring trace-aware agent evaluation to the scorer framework for the first time. You pass an MLflow trace, and the scorer reads the full span tree (plans, tool calls, intermediate outputs, everything) using the [Agent GPA framework](https://www.snowflake.com/en/engineering-blog/ai-agent-evaluation-gpa-framework/) developed by the TruLens team at Snowflake.
+
+## The Agent GPA Framework
+
+GPA stands for Goal-Plan-Action, and it evaluates the three alignment dimensions in an agent's execution:
+
+**Goal-Plan alignment** asks: did the agent make a good strategy? `PlanQuality` checks whether the plan decomposes the goal into feasible subtasks. `ToolSelection` checks whether the agent picked the right tools for each subtask. An agent that gets the right answer by brute-forcing every available tool has a Goal-Plan problem even if the output looks fine.
-**Goal-Plan alignment** asks: did the agent make a good strategy? `PlanQuality` checks whether the plan decomposes the goal into feasible subtasks. `ToolSelection` checks whether the agent picked the right tools for each subtask. An agent that gets the right answer by brute-forcing every available tool has a Goal-Plan problem even if the output looks fine.
+**Goal-Plan alignment** asks: did the agent make a good strategy? An agent that gets the right answer by brute-forcing every available tool has a Goal-Plan problem even if the output looks fine.
+- `PlanQuality` checks whether the plan decomposes the goal into feasible subtasks. 
+- `ToolSelection` checks whether the agent picked the right tools for each subtask.
-**Goal-Plan alignment** asks: did the agent make a good strategy? `PlanQuality` checks whether the plan decomposes the goal into feasible subtasks. `ToolSelection` checks whether the agent picked the right tools for each subtask. An agent that gets the right answer by brute-forcing every available tool has a Goal-Plan problem even if the output looks fine.
+**Goal-Plan alignment** asks: did the agent make a good strategy? An agent that gets the right answer by brute-forcing every available tool has a Goal-Plan problem even if the output looks fine.
+- `PlanQuality` checks whether the plan decomposes the goal into feasible subtasks. 
+- `ToolSelection` checks whether the agent picked the right tools for each subtask.
+
+**Plan-Action alignment** asks: did the agent follow through? `PlanAdherence` checks whether the agent's actual actions match its stated plan. Did it skip steps, reorder things, or repeat work? `ToolCalling` checks whether function calls are valid, with correct parameters and complete inputs.
+
+**Holistic alignment** looks at the trajectory as a whole. `LogicalConsistency` checks whether each step is coherent with prior context and reasoning. `ExecutionEfficiency` checks whether the agent reached the goal without redundant calls.
+
+On the [TRAIL/GAIA benchmark](https://arxiv.org/abs/2510.08847), GPA judges identify 95% of human-labeled agent errors (267/281), compared to 55% for standard judges that only look at final outputs. That 40-percentage-point gap is what you leave on the table when you only evaluate the answer.
+
+## Agent Trace Scorers
+
+The integration brings GPA evaluation to MLflow's scorer API, making it accessible through the same `mlflow.genai.evaluate()` interface used for all other scorers.
+
+The integration exposes six scorers, one for each GPA dimension:
-The integration exposes six scorers, one for each GPA dimension:
+The integration exposes six agent scorers covering the three GPA dimensions:
-The integration exposes six scorers, one for each GPA dimension:
+The integration exposes six agent scorers covering the three GPA dimensions:
+
+```python
+from mlflow.genai.scorers.trulens import (
+    PlanQuality,
+    ToolSelection,
+    PlanAdherence,
+    ToolCalling,
+    LogicalConsistency,
+    ExecutionEfficiency,
+)
+
+scorer = PlanAdherence(model="openai:/gpt-4o")
+feedback = scorer(trace=my_agent_trace)
+
+print(feedback.value)      # Raw score from GPA evaluation
+print(feedback.rationale)  # Chain-of-thought reasoning explaining the score
+```
+
+| Scorer                | Alignment   | What it checks                                                    |
+| --------------------- | ----------- | ----------------------------------------------------------------- |
+| `PlanQuality`         | Goal-Plan   | Does the plan decompose the goal into feasible subtasks?          |
+| `ToolSelection`       | Goal-Plan   | Did the agent pick the right tools for each subtask?              |
+| `PlanAdherence`       | Plan-Action | Did the agent follow its plan, or skip and reorder steps?         |
+| `ToolCalling`         | Plan-Action | Are tool calls valid with correct parameters and complete inputs? |
+| `LogicalConsistency`  | Holistic    | Is each step coherent with prior context and reasoning?           |
+| `ExecutionEfficiency` | Holistic    | Did the agent reach the goal without redundant calls?             |
+
+Pass a trace and nothing else. Under the hood, the integration serializes your MLflow trace to JSON and passes the full span tree to TruLens' provider, which evaluates each dimension with chain-of-thought reasoning. You get back a score and a rationale explaining what it found. No manual span extraction needed.
+
+<img
+  src={require("./trulens-pipeline.png").default}
+  alt="Architecture diagram showing the TruLens trace evaluation pipeline: MLflow agent trace with spans is serialized to JSON, passed to the TruLens GPA Provider backed by a model provider, which evaluates across six scorer dimensions (PlanQuality, ToolSelection, PlanAdherence, ToolCalling, LogicalConsistency, ExecutionEfficiency) grouped by Goal-Plan, Plan-Action, and Holistic alignment, producing scores and rationales that flow into MLflow Feedback and the assessment table UI"
+  style={{ width: "100%", margin: "0 auto", display: "block" }}
+/>
+
+## How Trace Evaluation Catches What Output Evaluation Misses
+
+Here's a concrete scenario. Say you have a travel-planning agent that should: (1) search for flights, (2) check hotel availability, (3) book both. The agent returns "Your trip is booked!" and it looks correct. But the trace tells a different story:
+
+```
+Span 1: search_flights("NYC", "LAX", "2026-04-01") → 3 results
+Span 2: search_flights("NYC", "LAX", "2026-04-01") → 3 results  ← duplicate
+Span 3: search_hotels("LAX", "2026-04-01") → 2 results
+Span 4: book_flight(flight_id="FL123") → confirmed
+Span 5: book_hotel(hotel_id=None) → error, retried with "H456" → confirmed
+```
+
+Output-only evaluation gives this a pass. The trip got booked. Trace-level evaluation catches three problems:
-Output-only evaluation gives this a pass. The trip got booked. Trace-level evaluation catches three problems:
+Output-only evaluation gives this a pass - the trip got booked. However, trace-level evaluation catches three problems:
-Output-only evaluation gives this a pass. The trip got booked. Trace-level evaluation catches three problems:
+Output-only evaluation gives this a pass - the trip got booked. However, trace-level evaluation catches three problems:
+
+- **ExecutionEfficiency**: redundant flight search (Span 2 duplicates Span 1)
+- **ToolCalling**: `book_hotel` called with `None` before retry (Span 5)
+- **PlanAdherence**: the agent didn't check hotel results before booking a flight
+
+You'd evaluate this with:
+
+```python
+import mlflow
+from mlflow.genai.scorers.trulens import (
+    ExecutionEfficiency,
+    ToolCalling,
+    PlanAdherence,
+)
+
+traces = mlflow.search_traces(experiment_ids=["1"])
-traces = mlflow.search_traces(experiment_ids=["1"])
+traces = mlflow.search_traces(locations=["..."])
-traces = mlflow.search_traces(experiment_ids=["1"])
+traces = mlflow.search_traces(locations=["..."])
+
+results = mlflow.genai.evaluate(
+    data=traces,
+    scorers=[
+        ExecutionEfficiency(model="openai:/gpt-4o"),
-        ExecutionEfficiency(model="openai:/gpt-4o"),
+        ExecutionEfficiency(model="openai:/gpt-5-mini"),
-        ExecutionEfficiency(model="openai:/gpt-4o"),
+        ExecutionEfficiency(model="openai:/gpt-5-mini"),
+        ToolCalling(model="openai:/gpt-4o"),
+        PlanAdherence(model="openai:/gpt-4o"),
+    ],
+)
+```
+
+Results land in the MLflow experiment alongside any other evaluation results. Same UI, same assessment table, same filtering and search.
+
+<img
+  src={require("./mlflow-trace-detail.png").default}
+  alt="MLflow trace detail showing travel-planning agent with 7 spans (plan_trip, search_flights x2, search_hotels, book_flight, book_hotel, book_hotel_retry) on the left, and 6 TruLens GPA assessments on the right with execution_efficiency expanded to show its rationale: the agent made a redundant search_flights call and failed the first book_hotel attempt"
+  width="100%"
+/>
+
+## RAG Scorers
+
+The integration also includes four RAG evaluation scorers. These follow the same pattern as the existing [DeepEval, RAGAS, and Phoenix judges](/blog/third-party-scorers). They evaluate inputs and outputs rather than traces:
+
+```python
+from mlflow.genai.scorers.trulens import Groundedness, ContextRelevance
+
+scorer = Groundedness(model="openai:/gpt-4o", threshold=0.6)
+feedback = scorer(
+    outputs="MLflow is an open-source ML platform.",
+    expectations={"context": "MLflow documentation text..."},
+)
+
+print(feedback.value)              # "yes" or "no" (based on threshold)
+print(feedback.metadata["score"])  # 0.85
+```
+
+The four RAG scorers (`Groundedness`, `ContextRelevance`, `AnswerRelevance`, and `Coherence`) return `YES` or `NO` based on a configurable threshold, with the raw float score in metadata.
+
+The difference from the agent trace scorers is that RAG scorers work with inputs, outputs, and expectations (like every other third-party scorer), while agent trace scorers require a trace object. You can use both in the same evaluation call. Just pass traces as your data, and each scorer takes what it needs.
+
+## Provider Routing
+
+Agent trace evaluation is expensive. Each scorer sends the full span tree to an LLM judge, and you're typically running multiple scorers per trace. Being locked into a single provider means you can't optimize cost by running cheaper scorers on a faster model or route sensitive traces through a private endpoint. TruLens scorers support the same model URI format as Phoenix and the other third-party integrations, so you pick the right provider for each scorer:
+
+```python
+# Databricks managed judge (built-in, no API key needed)
+scorer = Groundedness(model="databricks")
+
+# Databricks serving endpoint
+scorer = Groundedness(model="databricks:/my-judge-endpoint")
+
+# OpenAI
+scorer = Groundedness(model="openai:/gpt-4o")
+
+# Any LiteLLM-supported provider
+scorer = Groundedness(model="anthropic:/claude-sonnet-4-5-20250514")
+```
+
+For `"databricks"`, the integration uses MLflow's managed judge adapter. For everything else, it creates a LiteLLM provider. The routing happens inside the integration, so you just pass the URI.
+
+## Combining Trace and Output Evaluation
+
+With both trace-aware agent scorers and traditional input/output scorers in the same framework, MLflow can now evaluate agent behavior and output quality in a single API call. Where this gets interesting is mixing evaluation types in one call. Run agent trace scorers alongside RAG scorers and scorers from other frameworks:
+
+```python
+import mlflow
+from mlflow.genai.scorers.trulens import Groundedness, PlanAdherence, ExecutionEfficiency
+from mlflow.genai.scorers.phoenix import Hallucination
+
+results = mlflow.genai.evaluate(
+    data=traces,
+    scorers=[
+        # Agent behavior (trace-aware)
+        PlanAdherence(model="openai:/gpt-4o"),
+        ExecutionEfficiency(model="openai:/gpt-4o"),
+        # RAG quality
+        Groundedness(model="openai:/gpt-4o"),
+        # Content quality (Phoenix)
+        Hallucination(model="openai:/gpt-4o"),
+    ],
+)
+```
+
+Each scorer runs independently and writes results to the same experiment. The trace scorers read the span tree, and the RAG and content scorers read inputs/outputs. All results show up in the same MLflow assessment table.
+
+<img
+  src={require("./mlflow-traces-assessments.png").default}
+  alt="MLflow Traces list showing one agent trace with assessment columns: execution_efficiency (Fail), logical_consistency (Pass 100%), plan_adherence (Fail) with pass/fail rate bar charts for each scorer"
+  width="100%"
+/>
+
+## Getting Started
+
+```bash
+pip install mlflow trulens trulens-providers-litellm
-pip install mlflow trulens trulens-providers-litellm
+pip install mlflow>=3.10.0 trulens trulens-providers-litellm
-pip install mlflow trulens trulens-providers-litellm
+pip install mlflow>=3.10.0 trulens trulens-providers-litellm
+```
+
+The `trulens-providers-litellm` package is needed for non-Databricks model providers (OpenAI, Anthropic, etc.). If you only use the Databricks managed judge, the base `trulens` package is enough.
+
+```python
+from mlflow.genai.scorers.trulens import get_scorer
+
+# Create RAG scorers by name
+scorer = get_scorer("Groundedness", model="openai:/gpt-4o")
+
+# Agent trace scorers: instantiate directly
+from mlflow.genai.scorers.trulens import PlanAdherence
+scorer = PlanAdherence(model="openai:/gpt-4o")
+```
+
+## Resources
+
+- [Third-Party Scorers Overview](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party/)
+- [Agent GPA Framework (Snowflake Engineering Blog)](https://www.snowflake.com/en/engineering-blog/ai-agent-evaluation-gpa-framework/)
+- [Trace-Aware Agent Evaluation for MLflow (Snowflake Engineering Blog)](https://www.snowflake.com/en/engineering-blog/trace-aware-agent-evaluation-mlflow/)
+- [Agent GPA Paper (arXiv)](https://arxiv.org/abs/2510.08847)
+- [TruLens Documentation](https://www.trulens.org/getting_started/)
+- [Introducing DeepEval, RAGAS, and Phoenix Judges in MLflow](/blog/third-party-scorers)
+
+## Provenance
+
+I contributed the TruLens integration ([PR #19492](https://github.com/mlflow/mlflow/pull/19492)) to MLflow's open-source third-party scorer framework, adding 10 scorers: 4 RAG metrics and 6 agent trace evaluators based on the [Agent GPA framework](https://arxiv.org/abs/2510.08847). The integration went through four review rounds with Samraj Moorjani (Software Engineer at Databricks, MLflow maintainer), with final approval from Avesh C. Singh (Software Engineer at Databricks). It follows the scorer pattern Moorjani established in the DeepEval and RAGAS integrations and extends it to agent trace evaluation, a category that requires reading the full span tree rather than just inputs and outputs.
+
+[Josh Reini](https://www.snowflake.com/en/engineering-blog/trace-aware-agent-evaluation-mlflow/) (TruLens maintainer, Snowflake) reviewed the integration's scorer semantics and validated the trace-aware evaluation behavior. Reini published a [companion post on the Snowflake Engineering Blog](https://www.snowflake.com/en/engineering-blog/trace-aware-agent-evaluation-mlflow/) covering the Agent GPA research and TRAIL benchmark results in depth. A cross-project [documentation PR](https://github.com/truera/trulens/pull/2344) was also merged into the TruLens repository.
+
+Related artifacts:
+
+- [Upstream MLflow TruLens PR #19492](https://github.com/mlflow/mlflow/pull/19492) (merged)
+- [TruLens documentation PR #2344](https://github.com/truera/trulens/pull/2344) (merged, cross-project)
+- [Introducing DeepEval, RAGAS, and Phoenix Judges in MLflow](/blog/third-party-scorers) (companion blog)
diff --git a/website/blog/2026-03-04-mlflow-trulens-evaluation/mlflow-trace-detail.png b/website/blog/2026-03-04-mlflow-trulens-evaluation/mlflow-trace-detail.png
diff --git a/website/blog/2026-03-04-mlflow-trulens-evaluation/mlflow-traces-assessments.png b/website/blog/2026-03-04-mlflow-trulens-evaluation/mlflow-traces-assessments.png
diff --git a/website/blog/2026-03-04-mlflow-trulens-evaluation/trulens-pipeline.png b/website/blog/2026-03-04-mlflow-trulens-evaluation/trulens-pipeline.png
diff --git a/website/blog/2026-03-04-mlflow-trulens-evaluation/trulens-pipeline.svg b/website/blog/2026-03-04-mlflow-trulens-evaluation/trulens-pipeline.svg
diff --git a/website/static/img/blog/mlflow-trulens-evaluation-thumbnail.png b/website/static/img/blog/mlflow-trulens-evaluation-thumbnail.png