Skip to content

Commit d5116b0

Browse files
Merge pull request #7230 from MicrosoftDocs/main
Auto Publish – main to live - 2025-09-23 17:13 UTC
2 parents 4233006 + 3db78c4 commit d5116b0

31 files changed

+664
-550
lines changed

articles/ai-foundry/concepts/evaluation-evaluators/azure-openai-graders.md

Lines changed: 65 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -5,23 +5,30 @@ description: Learn about Azure OpenAI Graders for evaluating AI model outputs, i
55
author: lgayhardt
66
ms.author: lagayhar
77
ms.reviewer: mithigpe
8-
ms.date: 07/16/2025
8+
ms.date: 09/23/2025
99
ms.service: azure-ai-foundry
1010
ms.topic: reference
1111
ms.custom:
1212
- build-aifnd
1313
- build-2025
1414
---
1515

16-
# Azure OpenAI Graders (preview)
16+
# Azure OpenAI graders (preview)
1717

18-
[!INCLUDE [feature-preview](../../includes/feature-preview.md)]
18+
Azure OpenAI graders are a new set of evaluation tools in the Azure AI Foundry SDK that evaluate the performance of AI models and their outputs. These graders include:
19+
20+
- [Label grader](#label-grader)
21+
- [String checker](#string-checker)
22+
- [Text similarity](#text-similarity)
23+
- [Python grader](#python-grader)
1924

20-
The Azure OpenAI Graders are a new set of evaluation graders available in the Azure AI Foundry SDK, aimed at evaluating the performance of AI models and their outputs. These graders including [Label grader](#label-grader), [String checker](#string-checker), [Text similarity](#text-similarity), and [General grader](#general-grader) can be run locally or remotely. Each grader serves a specific purpose in assessing different aspects of AI model/model outputs.
25+
You can run graders locally or remotely. Each grader assesses specific aspects of AI models and their outputs.
26+
27+
[!INCLUDE [feature-preview](../../includes/feature-preview.md)]
2128

2229
## Model configuration for AI-assisted grader
2330

24-
For reference in the following code snippet, the AI-assisted grader uses a model configuration as follows:
31+
The following code snippet shows the model configuration used by the AI-assisted grader:
2532

2633
```python
2734
import os
@@ -33,7 +40,7 @@ model_config = AzureOpenAIModelConfiguration(
3340
azure_endpoint=os.environ.get("AZURE_ENDPOINT"),
3441
api_key=os.environ.get("AZURE_API_KEY"),
3542
azure_deployment=os.environ.get("AZURE_DEPLOYMENT_NAME"),
36-
api_version=os.environ.get("AZURE_API_VERSION"),
43+
api_version=os.environ.get("AZURE_API_VERSION")
3744
)
3845
```
3946

@@ -42,11 +49,11 @@ model_config = AzureOpenAIModelConfiguration(
4249
`AzureOpenAILabelGrader` uses your custom prompt to instruct a model to classify outputs based on labels you define. It returns structured results with explanations for why each label was chosen.
4350

4451
> [!NOTE]
45-
> We recommend using Azure OpenAI GPT o3-mini for best results.
52+
> We recommend using Azure OpenAI o3-mini for the best results.
4653
47-
Here's an example `data.jsonl` that is used in the following code snippets:
54+
Here's an example of `data.jsonl` used in the following code snippets:
4855

49-
```jsonl
56+
```json
5057
[
5158
{
5259
"query": "What is the importance of choosing the right provider in getting the most value out of your health insurance plan?",
@@ -83,11 +90,11 @@ from azure.ai.evaluation import AzureOpenAILabelGrader, evaluate
8390

8491
data_file_name="data.jsonl"
8592

86-
# Evaluation criteria: Determine if the response column contains texts that are "too short", "just right", or "too long" and pass if it is "just right"
93+
# Evaluation criteria: Determine if the response column contains text that is "too short," "just right," or "too long," and pass if it is "just right."
8794
label_grader = AzureOpenAILabelGrader(
8895
model_config=model_config,
89-
input=[{"content": "{{item.response}}", "role": "user"}
90-
{"content":"Any text including space that's more than 600 characters are too long, less than 500 characters are too short; 500 to 600 characters are just right.", "role":"user", "type": "message"}],
96+
input=[{"content": "{{item.response}}", "role": "user"},
97+
{"content": "Any text including space that's more than 600 characters is too long, less than 500 characters is too short; 500 to 600 characters is just right.", "role": "user", "type": "message"}],
9198
labels=["too short", "just right", "too long"],
9299
passing_labels=["just right"],
93100
model="gpt-4o",
@@ -104,7 +111,7 @@ label_grader_evaluation = evaluate(
104111

105112
### Label grader output
106113

107-
For each of the sets of sample data contained in the data file, an evaluation result of `True` or `False` is returned signifying if the output matches with the passing label defined. The `score` is `1.0` for `True` cases while `score` is `0.0` for `False` cases. The reason for why the model provided the label for the data can be found in `content` under `outputs.label.sample`.
114+
For each set of sample data in the data file, an evaluation result of `True` or `False` is returned, signifying if the output matches the defined passing label. The `score` is `1.0` for `True` cases, and `0.0` for `False` cases. The reason the model provided the label for the data is in `content` under `outputs.label.sample`.
108115

109116
```python
110117
'outputs.label.sample':
@@ -114,12 +121,11 @@ For each of the sets of sample data contained in the data file, an evaluation re
114121
'content': '{"steps":[{"description":"Calculate the number of characters in the user\'s input including spaces.","conclusion":"The provided text contains 575 characters."},{"description":"Evaluate if the character count falls within the given ranges (greater than 600 too long, less than 500 too short, 500 to 600 just right).","conclusion":"The character count falls between 500 and 600, categorized as \'just right.\'"}],"result":"just right"}'}],
115122
...
116123
...
117-
'outputs.label.label_result': 'pass',
118124
'outputs.label.passed': True,
119125
'outputs.label.score': 1.0
120126
```
121127

122-
Aside from individual data evaluation results, the grader also returns a metric indicating the overall dataset pass rate.
128+
In addition to individual data evaluation results, the grader returns a metric indicating the overall dataset pass rate.
123129

124130
```python
125131
'metrics': {'label.pass_rate': 0.2}, #1/5 in this case
@@ -139,7 +145,7 @@ string_grader = AzureOpenAIStringCheckGrader(
139145
model_config=model_config,
140146
input="{{item.query}}",
141147
name="starts with what is",
142-
operation="like", # "eq" for equal, "ne" for not equal, "like" for contain, "ilike" for case insensitive contain
148+
operation="like", # "eq" for equal, "ne" for not equal, "like" for contains, "ilike" for case-insensitive contains
143149
reference="What is",
144150
)
145151

@@ -153,23 +159,22 @@ string_grader_evaluation = evaluate(
153159

154160
### String checker output
155161

156-
For each of the sets of sample data contained in the data file, an evaluation result of `True` or `False` is returned signifying if the input text matches with pattern matching rules defined. The `score` is `1.0` for `True` cases while `score` is `0.0` for `False` cases.
162+
For each set of sample data in the data file, an evaluation result of `True` or `False` is returned, indicating whether the input text matches the defined pattern-matching rules. The `score` is `1.0` for `True` cases while `score` is `0.0` for `False` cases.
157163

158164
```python
159-
'outputs.string.string_result': 'pass',
160165
'outputs.string.passed': True,
161166
'outputs.string.score': 1.0
162167
```
163168

164169
The grader also returns a metric indicating the overall dataset pass rate.
165170

166171
```python
167-
'metrics': {'string.pass_rate': 0.4}, #2/5 in this case
172+
'metrics': {'string.pass_rate': 0.4}, # 2/5 in this case
168173
```
169174

170175
## Text similarity
171176

172-
Evaluates how closely input text matches a reference value using similarity metrics like`fuzzy_match`, `BLEU`, `ROUGE`, or `METEOR`. Useful for assessing text quality or semantic closeness.
177+
Evaluates how closely input text matches a reference value using similarity metrics like `fuzzy_match`, `BLEU`, `ROUGE`, or `METEOR`. This is useful for assessing text quality or semantic closeness.
173178

174179
### Text similarity example
175180

@@ -197,69 +202,83 @@ sim_grader_evaluation
197202

198203
### Text similarity output
199204

200-
For each set of sample data contained in the data file, a numerical similarity score is generated. This score, ranging from 0 to 1, indicates the degree of similarity, with higher scores representing greater similarity. Additionally, an evaluation result of `True` or `False` is returned, signifying whether the similarity score meets or exceeds the specified threshold based on the evaluation metric defined in the grader.
205+
For each set of sample data in the data file, a numerical similarity score is generated. This score ranges from 0 to 1 and indicates the degree of similarity, with higher scores representing greater similarity. An evaluation result of `True` or `False` is also returned, signifying whether the similarity score meets or exceeds the specified threshold based on the evaluation metric defined in the grader.
201206

202207
```python
203-
'outputs.similarity.similarity_result': 'pass',
204208
'outputs.similarity.passed': True,
205209
'outputs.similarity.score': 0.6117136659436009
206210
```
207211

208212
The grader also returns a metric indicating the overall dataset pass rate.
209213

210214
```python
211-
'metrics': {'similarity.pass_rate': 0.4}, #2/5 in this case
215+
'metrics': {'similarity.pass_rate': 0.4}, # 2 out of 5 in this case
212216
```
213217

214-
## General grader
218+
## Python Grader
215219

216-
Advanced users have the capability to import or define a custom grader and integrate it into the AOAI general grader. This allows for evaluations to be performed based on specific areas of interest aside from the existing AOAI graders. Following is an example to import the OpenAI `StringCheckGrader` and construct it to be ran as a AOAI general grader on Foundry SDK.
220+
Advanced users can create or import custom Python grader functions and integrate them into the Azure OpenAI Python grader. This enables evaluations tailored to specific areas of interest beyond the capabilities of the existing Azure OpenAI graders. The following example demonstrates how to import a custom similarity grader function and configure it to run as an Azure OpenAI Python grader using the Azure AI Foundry SDK.
217221

218222
### Example
219223

220224
```python
221-
from openai.types.graders import StringCheckGrader
222-
from azure.ai.evaluation import AzureOpenAIGrader
225+
from azure.ai.evaluation import AzureOpenAIPythonGrader
223226

224-
# Define an string check grader config directly using the OAI SDK
225-
# Evaluation criteria: Pass if query column contains "Northwind"
226-
oai_string_check_grader = StringCheckGrader(
227-
input="{{item.query}}",
228-
name="contains hello",
229-
operation="like",
230-
reference="Northwind",
231-
type="string_check"
232-
)
233-
# Plug that into the general grader
234-
general_grader = AzureOpenAIGrader(
235-
model_config=model_config,
236-
grader_config=oai_string_check_grader
227+
python_similarity_grader = AzureOpenAIPythonGrader(
228+
model_config=model_config_aoai,
229+
name="custom_similarity",
230+
image_tag="2025-05-08",
231+
pass_threshold=0.3,
232+
source="""
233+
def grade(sample, item) -> float:
234+
\"\"\"
235+
Custom similarity grader using word overlap.
236+
Note: All data is in the 'item' parameter.
237+
\"\"\"
238+
# Extract from item, not sample!
239+
response = item.get("response", "") if isinstance(item, dict) else ""
240+
ground_truth = item.get("ground_truth", "") if isinstance(item, dict) else ""
241+
242+
# Simple word overlap similarity
243+
response_words = set(response.lower().split())
244+
truth_words = set(ground_truth.lower().split())
245+
246+
if not truth_words:
247+
return 0.0
248+
249+
overlap = response_words.intersection(truth_words)
250+
similarity = len(overlap) / len(truth_words)
251+
252+
return min(1.0, similarity)
253+
""",
237254
)
255+
256+
file_name = "eval_this.jsonl"
238257
evaluation = evaluate(
239258
data=data_file_name,
240259
evaluators={
241-
"general": general_grader,
260+
"custom_similarity": python_similarity_grader,
242261
},
262+
#azure_ai_project=azure_ai_project,
243263
)
244264
evaluation
245265
```
246266

247267
### Output
248268

249-
For each set of sample data contained in the data file, general grader returns a numerical score that is a 0-1 float and a higher score is better. Given a numerical threshold defined as part of the custom grader, we also output `True` if the score >= threshold, or `False` otherwise.
269+
For each set of sample data in the data file, the Python grader returns a numerical score based on the defined function. Given a numerical threshold defined as part of the custom grader, we also output `True` if the score >= threshold, or `False` otherwise.
250270

251271
For example:
252272

253273
```python
254-
'outputs.general.general_result': 'pass',
255-
'outputs.general.passed': True,
256-
'outputs.general.score': 1.0
274+
"outputs.custom_similarity.passed": false,
275+
"outputs.custom_similarity.score": 0.0
257276
```
258277

259278
Aside from individual data evaluation results, the grader also returns a metric indicating the overall dataset pass rate.
260279

261280
```python
262-
'metrics': {'general.pass_rate': 0.4}, #2/5 in this case
281+
'metrics': {'custom_similarity.pass_rate': 0.0}, #0/5 in this case
263282
```
264283

265284
## Related content

articles/ai-foundry/foundry-local/concepts/foundry-local-architecture.md

Lines changed: 19 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,12 @@ The hardware abstraction layer ensures that Foundry Local can run on various dev
9393
- **multiple _execution providers_**, such as NVIDIA CUDA, AMD, Qualcomm, Intel.
9494
- **multiple _device types_**, such as CPU, GPU, NPU.
9595

96+
> [!NOTE]
97+
> For Intel NPU support on Windows, you need to install the [Intel NPU driver](https://www.intel.com/content/www/us/en/download/794734/intel-npu-driver-windows.html) to enable hardware acceleration.
98+
99+
> [!NOTE]
100+
> For Qualcomm NPU support, you need to install the [Qualcomm NPU driver](https://softwarecenter.qualcomm.com/catalog/item/QHND). If you encounter the error `Qnn error code 5005: "Failed to load from EpContext model. qnn_backend_manager."`, this typically indicates an outdated driver or NPU resource conflicts. Try rebooting to clear NPU resource conflicts, especially after using Windows Copilot+ features.
101+
96102
### Developer experiences
97103

98104
The Foundry Local architecture is designed to provide a seamless developer experience, enabling easy integration and interaction with AI models.
@@ -123,19 +129,22 @@ Foundry Local supports integration with various SDKs in most languages, such as
123129
The AI Toolkit for Visual Studio Code provides a user-friendly interface for developers to interact with Foundry Local. It allows users to run models, manage the local cache, and visualize results directly within the IDE.
124130

125131
**Features**:
126-
- Model management: Download, load, and run models from within the IDE.
127-
- Interactive console: Send requests and view responses in real-time.
128-
- Visualization tools: Graphical representation of model performance and results.
132+
133+
- Model management: Download, load, and run models from within the IDE.
134+
- Interactive console: Send requests and view responses in real-time.
135+
- Visualization tools: Graphical representation of model performance and results.
129136

130137
**Prerequisites:**
131-
- You have installed [Foundry Local](../get-started.md) and have a model service running.
132-
- You have installed the [AI Toolkit for Visual Studio Code](https://marketplace.visualstudio.com/items?itemName=ms-windows-ai-studio.windows-ai-studio) extension.
133-
138+
139+
- You have installed [Foundry Local](../get-started.md) and have a model service running.
140+
- You have installed the [AI Toolkit for Visual Studio Code](https://marketplace.visualstudio.com/items?itemName=ms-windows-ai-studio.windows-ai-studio) extension.
141+
134142
**Connect Foundry Local model to AI Toolkit:**
135-
1. **Add model in AI Toolkit**: Open AI Toolkit from the activity bar of Visual Studio Code. In the 'My Models' panel, click the 'Add model for remote interface' button and then select 'Add a custom model' from the dropdown menu.
136-
2. **Enter the chat compatible endpoint URL**: Enter `http://localhost:PORT/v1/chat/completions` where PORT is replaced with the port number of your Foundry Local service endpoint. You can see the port of your locally running service using the CLI command `foundry service status`. Foundry Local dynamically assigns a port, so it might not always the same.
137-
3. **Provide model name**: Enter the exact model name you which to use from Foundry Local, for example `phi-3.5-mini`. You can list all previously downloaded and locally cached models using the CLI command `foundry cache list` or use `foundry model list` to see all available models for local use. You’ll also be asked to enter a display name, which is only for your own local use, so to avoid confusion it’s recommended to enter the same name as the exact model name.
138-
4. **Authentication**: If your local setup doesn't require authentication *(which is the default for a Foundry Local setup)*, you can leave the authentication headers field blank and press Enter.
143+
144+
1. **Add model in AI Toolkit**: Open AI Toolkit from the activity bar of Visual Studio Code. In the 'My Models' panel, select the 'Add model for remote interface' button and then select 'Add a custom model' from the dropdown menu.
145+
2. **Enter the chat compatible endpoint URL**: Enter `http://localhost:PORT/v1/chat/completions` where PORT is replaced with the port number of your Foundry Local service endpoint. You can see the port of your locally running service using the CLI command `foundry service status`. Foundry Local dynamically assigns a port, so it might not always be the same.
146+
3. **Provide model name**: Enter the exact model name you which to use from Foundry Local, for example `phi-3.5-mini`. You can list all previously downloaded and locally cached models using the CLI command `foundry cache list` or use `foundry model list` to see all available models for local use. You’ll also be asked to enter a display name, which is only for your own local use, so to avoid confusion it’s recommended to enter the same name as the exact model name.
147+
4. **Authentication**: If your local setup doesn't require authentication _(which is the default for a Foundry Local setup)_, you can leave the authentication headers field blank and press Enter.
139148

140149
After completing these steps, your Foundry Local model will appear in the 'My Models' list in AI Toolkit and is ready to be used by right-clicking on your model and select 'Load in Playground'.
141150

0 commit comments

Comments
 (0)