Skip to content

Commit 6ec8e89

Browse files
sfc-gh-jreiniCortex Code
andauthored
MLFlow Release Finishing Touches (#2359)
* add missed mkdocs link * Fix MLflow integration issues with TruLens scorers 1. Fix AttributeError when instrumenting LiteLLM endpoint The _instrument_class method was failing with: AttributeError: 'CallTypes' object has no attribute '__name__' This occurred because the method tried to wrap non-callable attributes (like enum values) that happened to have the same name as the target method. Added a callable() check to skip non-callable attributes. 2. Suppress noisy third-party library warnings Added warning filters to suppress: - pkg_resources is deprecated warning from munch library - python-dotenv could not parse statement warnings/logs These warnings were cluttering output when using TruLens scorers via MLflow GenAI evaluation framework. .... Generated with [Cortex Code](https://docs.snowflake.com/user-guide/snowflake-cortex/cortex-agents) Co-Authored-By: Cortex Code <noreply@snowflake.com> * Add MLflow + TruLens scorers example notebook New example notebook demonstrating how to use TruLens feedback functions as first-class scorers in MLflow GenAI evaluation framework (MLflow 3.10+). Covers: - RAG evaluation scorers (Groundedness, ContextRelevance, AnswerRelevance) - Output scorers (Coherence) - Agent trace scorers (ToolSelection, ToolCalling) - Batch evaluation with mlflow.genai.evaluate - Threshold configuration and multi-provider support .... Generated with [Cortex Code](https://docs.snowflake.com/user-guide/snowflake-cortex/cortex-agents) Co-Authored-By: Cortex Code <noreply@snowflake.com> * Update MLflow + TruLens scorers example notebook Removed redundant Best Practices section to keep notebook focused on demonstrating scorer usage. .... Generated with [Cortex Code](https://docs.snowflake.com/user-guide/snowflake-cortex/cortex-agents) Co-Authored-By: Cortex Code <noreply@snowflake.com> * Add agent evaluation examples to MLflow integration docs Added new "Agent Evaluation" section covering: - Batch agent evaluation with predict_fn - Evaluating individual agent traces - Note clarifying when to use agent vs RAG scorers .... Generated with [Cortex Code](https://docs.snowflake.com/user-guide/snowflake-cortex/cortex-agents) Co-Authored-By: Cortex Code <noreply@snowflake.com> * Rename 'Running Feedback Functions' to 'Running Metrics' --------- Co-authored-by: sfc-gh-jreini <sfc-gh-jreini@users.noreply.github.com> Co-authored-by: Cortex Code <noreply@snowflake.com>
1 parent fd6268b commit 6ec8e89

File tree

7 files changed

+1016
-31
lines changed

7 files changed

+1016
-31
lines changed

docs/component_guides/evaluation/mlflow.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -176,6 +176,86 @@ scorer = Groundedness(model="openai:/gpt-4o")
176176
feedback = scorer(trace=trace)
177177
```
178178

179+
## Agent Evaluation
180+
181+
Agent GPA scorers evaluate tool selection and execution in agentic workflows. These scorers require traces since they inspect tool call spans.
182+
183+
### Batch Agent Evaluation
184+
185+
Use `predict_fn` with `mlflow.genai.evaluate` to trace and evaluate agent runs:
186+
187+
```python
188+
import mlflow
189+
from mlflow.genai.scorers.trulens import (
190+
Groundedness,
191+
ToolSelection,
192+
ToolCalling,
193+
Coherence,
194+
)
195+
196+
mlflow.openai.autolog()
197+
198+
199+
def run_agent(inputs: dict) -> str:
200+
response = openai.chat.completions.create(
201+
model="gpt-4o",
202+
messages=[{"role": "user", "content": inputs["user_query"]}],
203+
tools=[...], # your tool definitions
204+
)
205+
# ... handle tool calls and return result
206+
207+
208+
agent_queries = [
209+
"What's the weather in Paris?",
210+
"Book a flight to Tokyo for next Monday",
211+
"Send an email to my team about the meeting",
212+
]
213+
214+
agent_eval_results = mlflow.genai.evaluate(
215+
data=[{"inputs": {"user_query": q}} for q in agent_queries],
216+
predict_fn=run_agent,
217+
scorers=[
218+
Groundedness(model="openai:/gpt-4o-mini"),
219+
ToolSelection(model="openai:/gpt-4o-mini"),
220+
ToolCalling(model="openai:/gpt-4o-mini"),
221+
Coherence(model="openai:/gpt-4o-mini"),
222+
],
223+
)
224+
225+
print(agent_eval_results.tables["eval_results"])
226+
```
227+
228+
### Evaluating Individual Agent Traces
229+
230+
You can also evaluate agent traces individually:
231+
232+
```python
233+
import mlflow
234+
from mlflow.genai.scorers.trulens import ToolSelection, ToolCalling
235+
236+
mlflow.openai.autolog()
237+
238+
# Run your agent
239+
result = run_agent({"user_query": "What's the weather in Paris?"})
240+
241+
# Get the trace
242+
trace = mlflow.get_last_active_trace()
243+
244+
# Evaluate tool usage
245+
tool_selection = ToolSelection(model="openai:/gpt-4o-mini")
246+
tool_calling = ToolCalling(model="openai:/gpt-4o-mini")
247+
248+
selection_feedback = tool_selection(trace=trace)
249+
calling_feedback = tool_calling(trace=trace)
250+
251+
print(f"Tool Selection: {selection_feedback.value}")
252+
print(f"Tool Calling: {calling_feedback.value}")
253+
print(f"Rationale: {selection_feedback.rationale}")
254+
```
255+
256+
!!! note "Agent vs RAG Scorers"
257+
RAG and output scorers (`Groundedness`, `Coherence`, etc.) can be called directly with data or on traces. Agent GPA scorers (`ToolSelection`, `ToolCalling`, etc.) require a `trace` parameter since they evaluate tool usage patterns within trace spans.
258+
179259
## Viewing Results
180260

181261
Results are automatically logged to MLflow:

0 commit comments

Comments
 (0)