@@ -176,6 +176,86 @@ scorer = Groundedness(model="openai:/gpt-4o")
176176feedback = scorer(trace = trace)
177177```
178178
179+ ## Agent Evaluation
180+
181+ Agent GPA scorers evaluate tool selection and execution in agentic workflows. These scorers require traces since they inspect tool call spans.
182+
183+ ### Batch Agent Evaluation
184+
185+ Use ` predict_fn ` with ` mlflow.genai.evaluate ` to trace and evaluate agent runs:
186+
187+ ``` python
188+ import mlflow
189+ from mlflow.genai.scorers.trulens import (
190+ Groundedness,
191+ ToolSelection,
192+ ToolCalling,
193+ Coherence,
194+ )
195+
196+ mlflow.openai.autolog()
197+
198+
199+ def run_agent (inputs : dict ) -> str :
200+ response = openai.chat.completions.create(
201+ model = " gpt-4o" ,
202+ messages = [{" role" : " user" , " content" : inputs[" user_query" ]}],
203+ tools = [... ], # your tool definitions
204+ )
205+ # ... handle tool calls and return result
206+
207+
208+ agent_queries = [
209+ " What's the weather in Paris?" ,
210+ " Book a flight to Tokyo for next Monday" ,
211+ " Send an email to my team about the meeting" ,
212+ ]
213+
214+ agent_eval_results = mlflow.genai.evaluate(
215+ data = [{" inputs" : {" user_query" : q}} for q in agent_queries],
216+ predict_fn = run_agent,
217+ scorers = [
218+ Groundedness(model = " openai:/gpt-4o-mini" ),
219+ ToolSelection(model = " openai:/gpt-4o-mini" ),
220+ ToolCalling(model = " openai:/gpt-4o-mini" ),
221+ Coherence(model = " openai:/gpt-4o-mini" ),
222+ ],
223+ )
224+
225+ print (agent_eval_results.tables[" eval_results" ])
226+ ```
227+
228+ ### Evaluating Individual Agent Traces
229+
230+ You can also evaluate agent traces individually:
231+
232+ ``` python
233+ import mlflow
234+ from mlflow.genai.scorers.trulens import ToolSelection, ToolCalling
235+
236+ mlflow.openai.autolog()
237+
238+ # Run your agent
239+ result = run_agent({" user_query" : " What's the weather in Paris?" })
240+
241+ # Get the trace
242+ trace = mlflow.get_last_active_trace()
243+
244+ # Evaluate tool usage
245+ tool_selection = ToolSelection(model = " openai:/gpt-4o-mini" )
246+ tool_calling = ToolCalling(model = " openai:/gpt-4o-mini" )
247+
248+ selection_feedback = tool_selection(trace = trace)
249+ calling_feedback = tool_calling(trace = trace)
250+
251+ print (f " Tool Selection: { selection_feedback.value} " )
252+ print (f " Tool Calling: { calling_feedback.value} " )
253+ print (f " Rationale: { selection_feedback.rationale} " )
254+ ```
255+
256+ !!! note "Agent vs RAG Scorers"
257+ RAG and output scorers (` Groundedness ` , ` Coherence ` , etc.) can be called directly with data or on traces. Agent GPA scorers (` ToolSelection ` , ` ToolCalling ` , etc.) require a ` trace ` parameter since they evaluate tool usage patterns within trace spans.
258+
179259## Viewing Results
180260
181261Results are automatically logged to MLflow:
0 commit comments