Skip to content

Commit 2ae4a98

Browse files
committed
Merge branch 'main' of https://github.com/MicrosoftDocs/azure-ai-docs-pr into azure-ai-foundry-agent-service-how-to
2 parents 391f387 + 4ca6aea commit 2ae4a98

26 files changed

+369
-1757
lines changed

.openpublishing.publish.config.json

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -200,6 +200,12 @@
200200
"branch": "main",
201201
"branch_mapping": {}
202202
},
203+
{
204+
"path_to_root": "azure-search-java-samples",
205+
"url": "https://github.com/Azure-Samples/azure-search-java-samples",
206+
"branch": "main",
207+
"branch_mapping": {}
208+
},
203209
{
204210
"path_to_root": "azure-search-dotnet-samples",
205211
"url": "https://github.com/Azure-Samples/azure-search-dotnet-samples",

articles/ai-foundry/concepts/evaluation-evaluators/agent-evaluators.md

Lines changed: 91 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ model_config = AzureOpenAIModelConfiguration(
5252
)
5353
```
5454

55-
### Evaluator model support
55+
### Evaluator models support
5656

5757
We support AzureOpenAI or OpenAI [reasoning models](../../../ai-services/openai/how-to/reasoning.md) and non-reasoning models for the LLM-judge depending on the evaluators:
5858

@@ -65,7 +65,7 @@ For complex evaluation that requires refined reasoning, we recommend a strong re
6565

6666
## Intent resolution
6767

68-
`IntentResolutionEvaluator` measures how well the system identifies and understands a user's request, including how well it scopes the users intent, asks clarifying questions, and reminds end users of its scope of capabilities. Higher score means better identification of user intent.
68+
`IntentResolutionEvaluator` measures how well the system identifies and understands a user's request, including how well it scopes the user's intent, asks clarifying questions, and reminds end users of its scope of capabilities. Higher score means better identification of user intent.
6969

7070
### Intent resolution example
7171

@@ -99,11 +99,9 @@ The numerical score is on a Likert scale (integer 1 to 5) and a higher score is
9999
}
100100
}
101101

102-
103-
104102
```
105103

106-
If you're building agents outside of Azure AI Agent Service, this evaluator accepts a schema typical for agent messages. To learn more, see our sample notebook for [Intent Resolution](https://aka.ms/intentresolution-sample).
104+
If you're building agents outside of Azure AI Foundry Agent Service, this evaluator accepts a schema typical for agent messages. To learn more, see our sample notebook for [Intent Resolution](https://aka.ms/intentresolution-sample).
107105

108106
## Tool call accuracy
109107

@@ -112,15 +110,100 @@ If you're building agents outside of Azure AI Agent Service, this evaluator acce
112110
- the correctness of parameters used in tool calls;
113111
- the counts of missing or excessive calls.
114112

115-
> [!NOTE]
116-
> `ToolCallAccuracyEvaluator` only supports Azure AI Agent's Function Tool evaluation, but doesn't support Built-in Tool evaluation. The agent run must have at least one Function Tool call and no Built-in Tool calls made to be evaluated.
113+
#### Tool call evaluation support
114+
115+
`ToolCallAccuracyEvaluator` supports evaluation in Azure AI Foundry Agent Service for the following tools:
116+
117+
- File Search
118+
- Azure AI Search
119+
- Bing Grounding
120+
- Bing Custom Search
121+
- SharePoint Grounding
122+
- Code Interpreter
123+
- Fabric Data Agent
124+
- OpenAPI
125+
- Function Tool (user-defined tools)
126+
127+
However, if a non-supported tool is used in the agent run, it outputs a "pass" and a reason that evaluating the invoked tool(s) isn't supported, for ease of filtering out these cases. It's recommended that you wrap non-supported tools as user-defined tools to enable evaluation.
117128

118129
### Tool call accuracy example
119130

120131
```python
121132
from azure.ai.evaluation import ToolCallAccuracyEvaluator
122133

123134
tool_call_accuracy = ToolCallAccuracyEvaluator(model_config=model_config, threshold=3)
135+
136+
# provide the agent response with tool calls
137+
tool_call_accuracy(
138+
query="What timezone corresponds to 41.8781,-87.6298?",
139+
response=[
140+
{
141+
"createdAt": "2025-04-25T23:55:52Z",
142+
"run_id": "run_DmnhUGqYd1vCBolcjjODVitB",
143+
"role": "assistant",
144+
"content": [
145+
{
146+
"type": "tool_call",
147+
"tool_call_id": "call_qi2ug31JqzDuLy7zF5uiMbGU",
148+
"name": "azure_maps_timezone",
149+
"arguments": {
150+
"lat": 41.878100000000003,
151+
"lon": -87.629800000000003
152+
}
153+
}
154+
]
155+
},
156+
{
157+
"createdAt": "2025-04-25T23:55:54Z",
158+
"run_id": "run_DmnhUGqYd1vCBolcjjODVitB",
159+
"tool_call_id": "call_qi2ug31JqzDuLy7zF5uiMbGU",
160+
"role": "tool",
161+
"content": [
162+
{
163+
"type": "tool_result",
164+
"tool_result": {
165+
"ianaId": "America/Chicago",
166+
"utcOffset": None,
167+
"abbreviation": None,
168+
"isDaylightSavingTime": None
169+
}
170+
}
171+
]
172+
},
173+
{
174+
"createdAt": "2025-04-25T23:55:55Z",
175+
"run_id": "run_DmnhUGqYd1vCBolcjjODVitB",
176+
"role": "assistant",
177+
"content": [
178+
{
179+
"type": "text",
180+
"text": "The timezone for the coordinates 41.8781, -87.6298 is America/Chicago."
181+
}
182+
]
183+
}
184+
],
185+
tool_definitions=[
186+
{
187+
"name": "azure_maps_timezone",
188+
"description": "local time zone information for a given latitude and longitude.",
189+
"parameters": {
190+
"type": "object",
191+
"properties": {
192+
"lat": {
193+
"type": "float",
194+
"description": "The latitude of the location."
195+
},
196+
"lon": {
197+
"type": "float",
198+
"description": "The longitude of the location."
199+
}
200+
}
201+
}
202+
}
203+
]
204+
)
205+
206+
# alternatively, provide the tool calls directly without the full agent response
124207
tool_call_accuracy(
125208
query="How is the weather in Seattle?",
126209
tool_calls=[{
@@ -188,7 +271,7 @@ If you're building agents outside of Azure AI Agent Service, this evaluator acce
188271

189272
## Task adherence
190273

191-
In various task-oriented AI systems such as agentic systems, it's important to assess whether the agent has stayed on track to complete a given task instead of making inefficient or out-of-scope steps. `TaskAdherenceEvaluator` measures how well an agents response adheres to their assigned tasks, according to their task instruction (extracted from system message and user query), and available tools. Higher score means better adherence of the system instruction to resolve the given task.
274+
In various task-oriented AI systems such as agentic systems, it's important to assess whether the agent has stayed on track to complete a given task instead of making inefficient or out-of-scope steps. `TaskAdherenceEvaluator` measures how well an agent's response adheres to their assigned tasks, according to their task instruction (extracted from system message and user query), and available tools. Higher score means better adherence of the system instruction to resolve the given task.
192275

193276
### Task adherence example
194277

articles/ai-foundry/concepts/evaluation-evaluators/rag-evaluators.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -243,8 +243,9 @@ AI systems can fabricate content or generate irrelevant responses outside the gi
243243
### Groundedness Pro example
244244

245245
```python
246-
import os
247246
from azure.ai.evaluation import GroundednessProEvaluator
247+
from azure.identity import DefaultAzureCredential
248+
import os
248249
from dotenv import load_dotenv
249250
load_dotenv()
250251

@@ -257,7 +258,7 @@ azure_ai_project = {
257258
## Using Azure AI Foundry Development Platform, example: AZURE_AI_PROJECT=https://your-account.services.ai.azure.com/api/projects/your-project
258259
azure_ai_project = os.environ.get("AZURE_AI_PROJECT")
259260

260-
groundedness_pro = GroundednessProEvaluator(azure_ai_project=azure_ai_project),
261+
groundedness_pro = GroundednessProEvaluator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential())
261262
groundedness_pro(
262263
query="Is Marie Curie is born in Paris?",
263264
context="Background: 1. Marie Curie is born on November 7, 1867. 2. Marie Curie is born in Warsaw.",

articles/ai-foundry/how-to/develop/agent-evaluate-sdk.md

Lines changed: 27 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -39,15 +39,27 @@ pip install azure-ai-evaluation
3939

4040
## Evaluate Azure AI agents
4141

42-
If you use [Foundry Agent Service](../../../ai-services/agents/overview.md), you can seamlessly evaluate your agents via our converter support for Azure AI agent threads and runs. We support this list of evaluators for Azure AI agent messages from our converter:
42+
If you use [Foundry Agent Service](../../../ai-services/agents/overview.md), you can seamlessly evaluate your agents using our converter support for Azure AI agents and Semantic Kernel agents. The following evaluators are supported for evaluation data returned by the converter: `IntentResolution`, `ToolCallAccuracy`, `TaskAdherence`, `Relevance`, and `Groundedness`.
4343

44-
### Evaluators supported for evaluation data converter
44+
> [!NOTE]
45+
> If you are building other agents that output a different schema, you can convert them into the general openai-style [agent message schema](#agent-message-schema) and use the above evaluators.
46+
> More generally, if you can parse the agent messages into the [required data formats](./evaluate-sdk.md#data-requirements-for-built-in-evaluators), you can also all of our evaluators.
4547
46-
- Quality: `IntentResolution`, `ToolCallAccuracy`, `TaskAdherence`, `Relevance`, `Coherence`, `Fluency`
47-
- Safety: `CodeVulnerabilities`, `Violence`, `Self-harm`, `Sexual`, `HateUnfairness`, `IndirectAttack`, `ProtectedMaterials`.
4848

49-
> [!NOTE]
50-
> `ToolCallAccuracyEvaluator` only supports Foundry Agent's Function Tool evaluation (user-defined Python functions), but doesn't support other Tool evaluation. If an agent run invoked a tool other than Function Tool, it outputs a "pass" and a reason that evaluating the invoked tool(s) isn't supported.
49+
#### Tool call evaluation support
50+
`ToolCallAccuracyEvaluator` supports evaluation in Azure AI Agent for the following tools:
51+
52+
- File Search
53+
- Azure AI Search
54+
- Bing Grounding
55+
- Bing Custom Search
56+
- SharePoint Grounding
57+
- Code Interpreter
58+
- Fabric Data Agent
59+
- OpenAPI
60+
- Function Tool (user-defined tools)
61+
62+
However, if a non-supported tool is used in the agent run, it outputs a "pass" and a reason that evaluating the invoked tool(s) isn't supported, for ease of filtering out these cases. It is recommended that you wrap non-supported tools as user-defined tools to enable evaluation.
5163

5264
Here's an example that shows you how to seamlessly build and evaluate an Azure AI agent. Separately from evaluation, Azure AI Foundry Agent Service requires `pip install azure-ai-projects azure-identity`, an Azure AI project connection string, and the supported models.
5365

@@ -168,12 +180,12 @@ run_id = run.id
168180
converted_data = converter.convert(thread_id, run_id)
169181
```
170182

171-
And that's it! `converted_data` contains all inputs required for [these evaluators](#evaluators-supported-for-evaluation-data-converter). You don't need to read the input requirements for each evaluator and do any work to parse the inputs. All you need to do is select your evaluator and call the evaluator on this single run. We support AzureOpenAI or OpenAI [reasoning models](../../../ai-services/openai/how-to/reasoning.md) and non-reasoning models for the judge depending on the evaluators:
183+
And that's it! `converted_data` contains all inputs required for [these evaluators](#evaluate-azure-ai-agents). You don't need to read the input requirements for each evaluator and do any work to parse the inputs. All you need to do is select your evaluator and call the evaluator on this single run. We support AzureOpenAI or OpenAI [reasoning models](../../../ai-services/openai/how-to/reasoning.md) and non-reasoning models for the judge depending on the evaluators:
172184

173185
| Evaluators | Reasoning Models as Judge (example: o-series models from Azure OpenAI / OpenAI) | Non-reasoning models as Judge (example: gpt-4.1, gpt-4o, etc.) | To enable |
174186
|--|--|--|--|
175-
| `Intent Resolution`, `Task Adherence`, `Tool Call Accuracy`, `Response Completeness`| Supported | Supported | Set additional parameter `is_reasoning_model=True` in initializing evaluators |
176-
| Other quality evaluators| Not Supported | Supported | -- |
187+
| All quality evaluators except for `GroundednessProEvaluator` | Supported | Supported | Set additional parameter `is_reasoning_model=True` in initializing evaluators |
188+
| `GroundednessProEvaluator` | User does not need to support model | User does not need to support model | -- |
177189

178190
For complex tasks that require refined reasoning for the evaluation, we recommend a strong reasoning model like `o3-mini` or the o-series mini models released afterwards with a balance of reasoning performance and cost efficiency.
179191

@@ -197,17 +209,18 @@ model_config = {
197209
"api_version": os.getenv("AZURE_API_VERSION"),
198210
}
199211

212+
# example config for a reasoning model
200213
reasoning_model_config = {
201214
"azure_deployment": "o3-mini",
202215
"api_key": os.getenv("AZURE_API_KEY"),
203216
"azure_endpoint": os.getenv("AZURE_ENDPOINT"),
204217
"api_version": os.getenv("AZURE_API_VERSION"),
205218
}
206219

207-
# Evaluators with reasoning model support
220+
# Evaluators you might want to use with reasoning models
208221
quality_evaluators = {evaluator.__name__: evaluator(model_config=reasoning_model_config, is_reasoning_model=True) for evaluator in [IntentResolutionEvaluator, TaskAdherenceEvaluator, ToolCallAccuracyEvaluator]}
209222

210-
# Other evaluators do not support reasoning models
223+
# Other evaluators you might NOT want to use with reasoning models
211224
quality_evaluators.update({ evaluator.__name__: evaluator(model_config=model_config) for evaluator in [CoherenceEvaluator, FluencyEvaluator, RelevanceEvaluator]})
212225

213226
## Using Azure AI Foundry (non-Hub) project endpoint, example: AZURE_AI_PROJECT=https://your-account.services.ai.azure.com/api/projects/your-project
@@ -223,7 +236,6 @@ for name, evaluator in quality_and_safety_evaluators.items():
223236
print(name)
224237
print(json.dumps(result, indent=4))
225238

226-
227239
```
228240

229241
#### Output format
@@ -233,12 +245,12 @@ AI-assisted quality evaluators provide a result for a query and response pair. T
233245
- `{metric_name}`: Provides a numerical score, on a Likert scale (integer 1 to 5) or a float between 0 and 1.
234246
- `{metric_name}_label`: Provides a binary label (if the metric naturally outputs a binary score).
235247
- `{metric_name}_reason`: Explains why a certain score or label was given for each data point.
248+
- `details`: Optional output containing debugging information about the quality of a single agent run.
236249

237250
To further improve intelligibility, all evaluators accept a binary threshold (unless their outputs are already binary) and output two new keys. For the binarization threshold, a default is set, which the user can override. The two new keys are:
238251

239252
- `{metric_name}_result`: A "pass" or "fail" string based on a binarization threshold.
240253
- `{metric_name}_threshold`: A numerical binarization threshold set by default or by the user.
241-
- `additional_details`: Contains debugging information about the quality of a single agent run.
242254

243255
See the following example output for some evaluators:
244256

@@ -316,13 +328,14 @@ If you're using agents outside Azure AI Foundry Agent Service, you can still eva
316328

317329
Agents typically emit messages to interact with a user or other agents. Our built-in evaluators can accept simple data types such as strings in `query`, `response`, and `ground_truth` according to the [single-turn data input requirements](./evaluate-sdk.md#data-requirements-for-built-in-evaluators). However, it can be a challenge to extract these simple data types from agent messages, due to the complex interaction patterns of agents and framework differences. For example, a single user query can trigger a long list of agent messages, typically with multiple tool calls invoked.
318330

319-
As illustrated in the following example, we enable agent message support specifically for the built-in evaluators `IntentResolutionEvaluator`, `ToolCallAccuracyEvaluator`, and `TaskAdherenceEvaluator` to evaluate these aspects of agentic workflow. These evaluators take `tool_calls` or `tool_definitions` as parameters unique to agents.
331+
As illustrated in the following example, we enable agent message support for the following built-in evaluators to evaluate these aspects of agentic workflow. These evaluators may take `tool_calls` or `tool_definitions` as parameters unique to agents when evaluating agents.
320332

321333
| Evaluator | `query` | `response` | `tool_calls` | `tool_definitions` |
322334
|----------------|---------------|---------------|---------------|---------------|
323335
| `IntentResolutionEvaluator` | Required: `Union[str, list[Message]]` | Required: `Union[str, list[Message]]` | Doesn't apply | Optional: `list[ToolCall]` |
324336
| `ToolCallAccuracyEvaluator` | Required: `Union[str, list[Message]]` | Optional: `Union[str, list[Message]]` | Optional: `Union[dict, list[ToolCall]]` | Required: `list[ToolDefinition]` |
325337
| `TaskAdherenceEvaluator` | Required: `Union[str, list[Message]]` | Required: `Union[str, list[Message]]` | Doesn't apply | Optional: `list[ToolCall]` |
338+
| `GroundednessEvaluator` | Required: `Union[str, list[Message]]` | Required: `Union[str, list[Message]]` | Doesn't apply | Required: `list[ToolCall]` |
326339

327340
- `Message`: `dict` OpenAI-style message that describes agent interactions with a user, where the `query` must include a system message as the first message.
328341
- `ToolCall`: `dict` that specifies tool calls invoked during agent interactions with a user.

articles/ai-foundry/how-to/develop/evaluate-sdk.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -57,14 +57,14 @@ Built-in evaluators can accept query and response pairs, a list of conversations
5757
| `IntentResolutionEvaluator` | | | | ||
5858
| `ToolCallAccuracyEvaluator` | | | | ||
5959
| `TaskAdherenceEvaluator` | | | | ||
60-
| `GroundednessEvaluator` || | | | |
60+
| `GroundednessEvaluator` || | | | |
6161
| `GroundednessProEvaluator` || | | | |
6262
| `RetrievalEvaluator` || | | | |
6363
| `DocumentRetrievalEvaluator` || | || |
6464
| `RelevanceEvaluator` || | | ||
65-
| `CoherenceEvaluator` || | | | |
66-
| `FluencyEvaluator` || | | | |
67-
| `ResponseCompletenessEvaluator` | | ||| |
65+
| `CoherenceEvaluator` || | | | |
66+
| `FluencyEvaluator` || | | | |
67+
| `ResponseCompletenessEvaluator` | | ||| |
6868
| `QAEvaluator` | | ||| |
6969
| **Natural Language Processing (NLP) Evaluators** |
7070
| `SimilarityEvaluator` | | ||| |
@@ -74,15 +74,15 @@ Built-in evaluators can accept query and response pairs, a list of conversations
7474
| `BleuScoreEvaluator` | | ||| |
7575
| `MeteorScoreEvaluator` | | ||| |
7676
| **Safety Evaluators** |
77-
| `ViolenceEvaluator` | || | | |
78-
| `SexualEvaluator` | || | | |
79-
| `SelfHarmEvaluator` | || | | |
80-
| `HateUnfairnessEvaluator` | || | | |
81-
| `ProtectedMaterialEvaluator` | || | | |
82-
| `ContentSafetyEvaluator` | || | | |
77+
| `ViolenceEvaluator` | || | | |
78+
| `SexualEvaluator` | || | | |
79+
| `SelfHarmEvaluator` | || | | |
80+
| `HateUnfairnessEvaluator` | || | | |
81+
| `ProtectedMaterialEvaluator` | || | | |
82+
| `ContentSafetyEvaluator` | || | | |
8383
| `UngroundedAttributesEvaluator` | | || | |
84-
| `CodeVulnerabilityEvaluator` | | || | |
85-
| `IndirectAttackEvaluator` || | | | |
84+
| `CodeVulnerabilityEvaluator` | | || | |
85+
| `IndirectAttackEvaluator` || | | | |
8686
| **Azure OpenAI Graders** |
8787
| `AzureOpenAILabelGrader` || | | | |
8888
| `AzureOpenAIStringCheckGrader` || | | | |

0 commit comments

Comments
 (0)