Skip to content

Commit d691e81

Browse files
authored
Add SK ChatCompletionAgent notebook (#271)
* sk notebook * changes * update docs * fix linter for unused var * clean output
1 parent 6622714 commit d691e81

File tree

2 files changed

+358
-1
lines changed

2 files changed

+358
-1
lines changed
Lines changed: 357 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,357 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "bf5280e2",
6+
"metadata": {},
7+
"source": [
8+
"# Evaluate Semantic Kernel AI (ChatCompletion) Agents in Azure AI Foundry"
9+
]
10+
},
11+
{
12+
"cell_type": "markdown",
13+
"id": "0330c099",
14+
"metadata": {},
15+
"source": [
16+
"## Objective\n",
17+
"\n",
18+
"This sample demonstrates how to evaluate Semantic Kernel AI ChatCompletionAgents in Azure AI Foundry. It provides a step-by-step guide to set up the environment, create an agent, and evaluate its performance."
19+
]
20+
},
21+
{
22+
"cell_type": "markdown",
23+
"id": "b364c694",
24+
"metadata": {},
25+
"source": [
26+
"## Time\n",
27+
"You can expect to complete this sample in approximately 20 minutes."
28+
]
29+
},
30+
{
31+
"cell_type": "markdown",
32+
"id": "919c6017",
33+
"metadata": {},
34+
"source": [
35+
"## Prerequisites\n",
36+
"### Packages\n",
37+
"- `semantic-kernel` installed (`pip install semantic-kernel`)\n",
38+
"- `azure-ai-evaluation` SDK installed\n",
39+
"- An Azure OpenAI resource with a deployment configured\n",
40+
"\n",
41+
"Before running the sample:\n",
42+
"```bash\n",
43+
"pip install semantic-kernel azure-ai-projects azure-identity azure-ai-evaluation\n",
44+
"```\n",
45+
"\n",
46+
"### Environment Variables\n",
47+
"- For **AzureChatService** (Semantic Kernel Agent):\n",
48+
" - **`api_key`** – Azure OpenAI API key used by the agent.\n",
49+
" - **`chat_deployment_name`** – Name of the deployed chat model (e.g., `gpt-35-turbo`) used by the agent.\n",
50+
" - **`endpoint`** – Azure OpenAI endpoint URL (e.g., `https://<your-resource>.openai.azure.com/`).\n",
51+
"- For **LLM Evaluation**:\n",
52+
" - **`AZURE_OPENAI_ENDPOINT`** – Azure OpenAI endpoint to be used by the evaluation LLM.\n",
53+
" - **`AZURE_OPENAI_API_KEY`** – Azure OpenAI API key for evaluation.\n",
54+
" - **`AZURE_OPENAI_API_VERSION`** – API version (e.g., `2024-05-01-preview`) for the evaluation LLM.\n",
55+
" - **`MODEL_DEPLOYMENT_NAME`** – Deployment name of the model used for evaluation*, as found under the \"Name\" column in the \"Models + endpoints\" tab in your Azure AI Foundry project*.\n",
56+
"- For Azure AI Foundry (Bonus):\n",
57+
" - **`AZURE_SUBSCRIPTION_ID`** – Your Azure subscription ID where the AI Foundry project is hosted.\n",
58+
" - **`PROJECT_NAME`** – Name of the Azure AI Foundry project.\n",
59+
" - **`RESOURCE_GROUP_NAME`** – Resource group containing your AI Foundry project."
60+
]
61+
},
62+
{
63+
"cell_type": "markdown",
64+
"id": "ba1d6576",
65+
"metadata": {},
66+
"source": [
67+
"### Create a AzureChatCompletion service - [reference](https://learn.microsoft.com/en-us/semantic-kernel/concepts/ai-services/chat-completion/?tabs=csharp-AzureOpenAI%2Cpython-AzureOpenAI%2Cjava-AzureOpenAI&pivots=programming-language-python)"
68+
]
69+
},
70+
{
71+
"cell_type": "code",
72+
"execution_count": null,
73+
"id": "7dc6ce40",
74+
"metadata": {},
75+
"outputs": [],
76+
"source": [
77+
"from semantic_kernel.connectors.ai.open_ai import AzureChatCompletion\n",
78+
"\n",
79+
"# You can do the following if you have set the necessary environment variables or created a .env file\n",
80+
"chat_completion_service = AzureChatCompletion(service_id=\"my-service-id\")"
81+
]
82+
},
83+
{
84+
"cell_type": "markdown",
85+
"id": "ef319288",
86+
"metadata": {},
87+
"source": [
88+
"### Create a ChatCompletionAgent - [reference](https://learn.microsoft.com/en-us/semantic-kernel/frameworks/agent/agent-types/chat-completion-agent?pivots=programming-language-python)"
89+
]
90+
},
91+
{
92+
"cell_type": "code",
93+
"execution_count": null,
94+
"id": "76781359",
95+
"metadata": {},
96+
"outputs": [],
97+
"source": [
98+
"from semantic_kernel.functions import kernel_function\n",
99+
"from typing import Annotated\n",
100+
"\n",
101+
"\n",
102+
"# This is a sample plugin that provides tools\n",
103+
"class MenuPlugin:\n",
104+
" \"\"\"A sample Menu Plugin used for the concept sample.\"\"\"\n",
105+
"\n",
106+
" @kernel_function(description=\"Provides a list of specials from the menu.\")\n",
107+
" def get_specials(self) -> Annotated[str, \"Returns the specials from the menu.\"]:\n",
108+
" return \"\"\"\n",
109+
" Special Soup: Clam Chowder\n",
110+
" Special Salad: Cobb Salad\n",
111+
" Special Drink: Chai Tea\n",
112+
" \"\"\"\n",
113+
"\n",
114+
" @kernel_function(description=\"Provides the price of the requested menu item.\")\n",
115+
" def get_item_price(\n",
116+
" self, menu_item: Annotated[str, \"The name of the menu item.\"]\n",
117+
" ) -> Annotated[str, \"Returns the price of the menu item.\"]:\n",
118+
" _ = menu_item # This is just to simulate a function that uses the input.\n",
119+
" return \"$9.99\""
120+
]
121+
},
122+
{
123+
"cell_type": "code",
124+
"execution_count": null,
125+
"id": "d6abead3",
126+
"metadata": {},
127+
"outputs": [],
128+
"source": [
129+
"from semantic_kernel.agents import ChatCompletionAgent\n",
130+
"\n",
131+
"# Create the agent by directly providing the chat completion service\n",
132+
"agent = ChatCompletionAgent(\n",
133+
" service=chat_completion_service,\n",
134+
" name=\"Chef\",\n",
135+
" instructions=\"Answer questions about the menu.\",\n",
136+
" plugins=[MenuPlugin()],\n",
137+
")"
138+
]
139+
},
140+
{
141+
"cell_type": "code",
142+
"execution_count": null,
143+
"id": "3b7b9ba3",
144+
"metadata": {},
145+
"outputs": [],
146+
"source": [
147+
"thread = None\n",
148+
"\n",
149+
"user_inputs = [\n",
150+
" \"Hello\",\n",
151+
" \"What is the special drink today?\",\n",
152+
" \"What does that cost?\",\n",
153+
" \"Thank you\",\n",
154+
"]\n",
155+
"\n",
156+
"for user_input in user_inputs:\n",
157+
" response = await agent.get_response(messages=user_input, thread=thread)\n",
158+
" print(f\"## User: {user_input}\")\n",
159+
" print(f\"## {response.name}: {response}\\n\")\n",
160+
" thread = response.thread"
161+
]
162+
},
163+
{
164+
"cell_type": "markdown",
165+
"id": "2586d3e5",
166+
"metadata": {},
167+
"source": [
168+
"### Converter"
169+
]
170+
},
171+
{
172+
"cell_type": "code",
173+
"execution_count": null,
174+
"id": "fcd6ac41",
175+
"metadata": {},
176+
"outputs": [],
177+
"source": [
178+
"from azure.ai.evaluation import SKAgentConverter\n",
179+
"\n",
180+
"# Get the avaiable turn indices for the thread,\n",
181+
"# useful for selecting a specific turn for evaluation\n",
182+
"turn_indices = await SKAgentConverter._get_thread_turn_indices(thread=thread)\n",
183+
"print(f\"Available turn indices: {turn_indices}\")"
184+
]
185+
},
186+
{
187+
"cell_type": "code",
188+
"execution_count": null,
189+
"id": "d1d4ae12",
190+
"metadata": {},
191+
"outputs": [],
192+
"source": [
193+
"converter = SKAgentConverter()\n",
194+
"\n",
195+
"# Get a single agent run data\n",
196+
"evaluation_data_single_run = await converter.convert(\n",
197+
" thread=thread,\n",
198+
" turn_index=2, # Specify the turn index you want to evaluate\n",
199+
" agent=agent, # Pass it to include the instructions and plugins in the evaluation data\n",
200+
")"
201+
]
202+
},
203+
{
204+
"cell_type": "code",
205+
"execution_count": null,
206+
"id": "7813b5eb",
207+
"metadata": {},
208+
"outputs": [],
209+
"source": [
210+
"import json\n",
211+
"\n",
212+
"file_name = \"evaluation_data.jsonl\"\n",
213+
"# Save the agent thread data to a JSONL file (all turns)\n",
214+
"evaluation_data = await converter.prepare_evaluation_data(threads=[thread], filename=file_name, agent=agent)\n",
215+
"# print(json.dumps(evaluation_data, indent=4))\n",
216+
"len(evaluation_data) # number of turns in the thread"
217+
]
218+
},
219+
{
220+
"cell_type": "markdown",
221+
"id": "8bf87cab",
222+
"metadata": {},
223+
"source": [
224+
"### Setting up evaluator\n",
225+
"\n",
226+
"We will select the following evaluators to assess the different aspects relevant for agent quality: \n",
227+
"\n",
228+
"- [Intent resolution](https://aka.ms/intentresolution-sample): measures the extent of which an agent identifies the correct intent from a user query. Scale: integer 1-5. Higher is better.\n",
229+
"- [Tool call accuracy](https://aka.ms/toolcallaccuracy-sample): evaluates the agent’s ability to select the appropriate tools, and process correct parameters from previous steps. Scale: float 0-1. Higher is better.\n",
230+
"- [Task adherence](https://aka.ms/taskadherence-sample): measures the extent of which an agent’s final response adheres to the task based on its system message and a user query. Scale: integer 1-5. Higher is better.\n"
231+
]
232+
},
233+
{
234+
"cell_type": "code",
235+
"execution_count": null,
236+
"id": "e6ee09df",
237+
"metadata": {},
238+
"outputs": [],
239+
"source": [
240+
"import os\n",
241+
"from pprint import pprint\n",
242+
"\n",
243+
"from azure.ai.evaluation import (\n",
244+
" ToolCallAccuracyEvaluator,\n",
245+
" AzureOpenAIModelConfiguration,\n",
246+
" IntentResolutionEvaluator,\n",
247+
" TaskAdherenceEvaluator,\n",
248+
")\n",
249+
"\n",
250+
"model_config = AzureOpenAIModelConfiguration(\n",
251+
" azure_endpoint=os.environ[\"AZURE_OPENAI_ENDPOINT\"],\n",
252+
" api_key=os.environ[\"AZURE_OPENAI_API_KEY\"],\n",
253+
" api_version=os.environ[\"AZURE_OPENAI_API_VERSION\"],\n",
254+
" azure_deployment=os.environ[\"MODEL_DEPLOYMENT_NAME\"],\n",
255+
")\n",
256+
"\n",
257+
"intent_resolution = IntentResolutionEvaluator(model_config=model_config)\n",
258+
"\n",
259+
"tool_call_accuracy = ToolCallAccuracyEvaluator(model_config=model_config)\n",
260+
"\n",
261+
"task_adherence = TaskAdherenceEvaluator(model_config=model_config)"
262+
]
263+
},
264+
{
265+
"cell_type": "code",
266+
"execution_count": null,
267+
"id": "80bd50ff",
268+
"metadata": {},
269+
"outputs": [],
270+
"source": [
271+
"# Test a single evaluation run\n",
272+
"evaluator = ToolCallAccuracyEvaluator(model_config=model_config)\n",
273+
"\n",
274+
"# evaluation_data_single_run.keys() # query, response, tool_definitions\n",
275+
"res = evaluator(**evaluation_data_single_run)\n",
276+
"print(json.dumps(res, indent=4))"
277+
]
278+
},
279+
{
280+
"cell_type": "markdown",
281+
"id": "06bab561",
282+
"metadata": {},
283+
"source": [
284+
"#### Bonus - run on perviously saved file for all turns"
285+
]
286+
},
287+
{
288+
"cell_type": "code",
289+
"execution_count": null,
290+
"id": "c0530c0d",
291+
"metadata": {},
292+
"outputs": [],
293+
"source": [
294+
"from azure.ai.evaluation import evaluate\n",
295+
"\n",
296+
"response = evaluate(\n",
297+
" data=file_name,\n",
298+
" evaluators={\n",
299+
" \"tool_call_accuracy\": tool_call_accuracy,\n",
300+
" \"intent_resolution\": intent_resolution,\n",
301+
" \"task_adherence\": task_adherence,\n",
302+
" },\n",
303+
" azure_ai_project={\n",
304+
" \"subscription_id\": os.environ[\"AZURE_SUBSCRIPTION_ID\"],\n",
305+
" \"project_name\": os.environ[\"PROJECT_NAME\"],\n",
306+
" \"resource_group_name\": os.environ[\"RESOURCE_GROUP_NAME\"],\n",
307+
" },\n",
308+
")\n",
309+
"\n",
310+
"pprint(f'AI Foundary URL: {response.get(\"studio_url\")}')"
311+
]
312+
},
313+
{
314+
"cell_type": "markdown",
315+
"id": "ac38d924",
316+
"metadata": {},
317+
"source": [
318+
"## Inspect results on Azure AI Foundry\n",
319+
"\n",
320+
"Go to AI Foundry URL for rich Azure AI Foundry data visualization to inspect the evaluation scores and reasoning to quickly identify bugs and issues of your agent to fix and improve."
321+
]
322+
},
323+
{
324+
"cell_type": "code",
325+
"execution_count": null,
326+
"id": "225ae69a",
327+
"metadata": {},
328+
"outputs": [],
329+
"source": [
330+
"# alternatively, you can use the following to get the evaluation results in memory\n",
331+
"\n",
332+
"# average scores across all runs\n",
333+
"pprint(response[\"metrics\"])"
334+
]
335+
}
336+
],
337+
"metadata": {
338+
"kernelspec": {
339+
"display_name": "Python 3",
340+
"language": "python",
341+
"name": "python3"
342+
},
343+
"language_info": {
344+
"codemirror_mode": {
345+
"name": "ipython",
346+
"version": 3
347+
},
348+
"file_extension": ".py",
349+
"mimetype": "text/x-python",
350+
"name": "python",
351+
"nbconvert_exporter": "python",
352+
"pygments_lexer": "ipython3"
353+
}
354+
},
355+
"nbformat": 4,
356+
"nbformat_minor": 5
357+
}

scenarios/evaluate/Supported_Evaluation_Metrics/Agent_Evaluation/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ A general AI agent workflow typically contains a linear workflow of intent resol
1919
- [Task adherence](https://aka.ms/taskadherence-sample): measures the extent of which an agent’s final response adheres to the task based on its system message and a user query.
2020
- [Response Completeness](https://aka.ms/rescompleteness-sample): measures the extent of which an agent or RAG response is complete (does not miss critical information) compared to the ground truth.
2121
- [End-to-end Azure AI agent evaluation](https://aka.ms/e2e-agent-eval-sample): create an agent using Azure AI Agent Service and seamlessly evaluate its thread and run data, via converter support.
22-
22+
- [End-to-end SK Chat Completion Agent evaluation](Evaluate_SK_Chat_Completion_Agent.ipynb): create an SK Chat Completion Agent and evaluate its thread and run data, via converter support.
2323
### Objective
2424

2525
This tutorial provides a step-by-step guide on how to evaluate AI agents using quality evaluators. By the end of this tutorial, you should be able to:

0 commit comments

Comments
 (0)