You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-foundry/how-to/develop/agent-evaluate-sdk.md
+9-9Lines changed: 9 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -189,7 +189,7 @@ The result of the AI-assisted quality evaluators for a query and response pair i
189
189
To further improve intelligibility, all evaluators accept a binary threshold (unless they output already binary outputs) and output two new keys. For the binarization threshold, a default is set and user can override it. The two new keys are:
190
190
191
191
-`{metric_name}_result` a "pass" or "fail" string based on a binarization threshold.
192
-
-`{metric_name}_threshold` a numerical binarization threshold set by default or by the user
192
+
-`{metric_name}_threshold` a numerical binarization threshold set by default or by the user.
193
193
-`additional_details` contains debugging information about the quality of a single agent run.
194
194
195
195
```json
@@ -238,7 +238,7 @@ from azure.ai.evaluation import AIAgentConverter
238
238
# Initialize the converter
239
239
converter = AIAgentConverter(project_client)
240
240
241
-
#specify a file path to save agent output (which is evaluation input data)
241
+
#Specify a file path to save agent output (which is evaluation input data)
@@ -303,7 +303,7 @@ Following the URI, you will be redirected to Foundry to view your evaluation res
303
303
With Azure AI Evaluation SDK client library, you can seamlessly evaluate your Azure AI agents via our converter support, which enables observability and transparency into agentic workflows.
304
304
305
305
306
-
## Evaluators with agent message support
306
+
## Evaluating other agents
307
307
308
308
For agents outside of Azure AI Agent Service, you can still evaluate them by preparing the right data for the evaluators of your choice.
309
309
@@ -328,7 +328,7 @@ We'll demonstrate some examples of the two data formats: simple agent data, and
328
328
As with other [built-in AI-assisted quality evaluators](./evaluate-sdk.md#performance-and-quality-evaluators), `IntentResolutionEvaluator` and `TaskAdherenceEvaluator` output a likert score (integer 1-5; higher score is better). `ToolCallAccuracyEvaluator` outputs the passing rate of all tool calls made (a float between 0-1) based on user query. To further improve intelligibility, all evaluators accept a binary threshold and output two new keys. For the binarization threshold, a default is set and user can override it. The two new keys are:
329
329
330
330
-`{metric_name}_result` a "pass" or "fail" string based on a binarization threshold.
331
-
-`{metric_name}_threshold` a numerical binarization threshold set by default or by the user
331
+
-`{metric_name}_threshold` a numerical binarization threshold set by default or by the user.
Copy file name to clipboardExpand all lines: articles/ai-foundry/how-to/develop/cloud-evaluation.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -103,9 +103,9 @@ or contain conversation data like this:
103
103
}
104
104
```
105
105
106
-
For more details on input data formats, refer to [single-turn data](./evaluate-sdk.md#single-turn-support-for-text), [conversation data](./evaluate-sdk.md#conversation-support-for-text), and [conversation data for images and multi-modalities](./evaluate-sdk.md#conversation-support-for-images-and-multi-modal-text-and-image).
106
+
To learn more about input data formatsfor evaluating GenAI applications, see [single-turn data](./evaluate-sdk.md#single-turn-support-for-text), [conversation data](./evaluate-sdk.md#conversation-support-for-text), and [conversation data for images and multi-modalities](./evaluate-sdk.md#conversation-support-for-images-and-multi-modal-text-and-image).
107
107
108
-
For agent evaluation, refer to [evaluator supportforagent messages](./agent-evaluate-sdk.md#evaluators-with-agent-message-support).
108
+
To learn more about input data formatsforevaluating agents, see [evaluating Azure AI agents](./agent-evaluate-sdk.md#evaluate-azure-ai-agents) and [evaluating other agents](./agent-evaluate-sdk.md/#evaluating-other-agents).
109
109
110
110
111
111
We provide two ways to register your data in Azure AI project required forevaluationsin the cloud:
After logging your custom evaluator to your Azure AI project, you can view it in your [Evaluator library](../evaluate-generative-ai-app.md#view-and-manage-the-evaluators-in-the-evaluator-library) under **Evaluation** tab of your Azure AI project.
238
238
239
-
## Cloud evaluation (preview) with Azure AI Projects SDK
239
+
## Submit a cloud evaluation
240
240
241
-
Putting the above altogether, you can now submit a cloud evaluation with Azure AI Projects SDK via a Python API. See the following example specifying an NLP evaluator (F1 score), AI-assisted quality and safety evaluator (Relevance and Violence), and a custom evaluator (Friendliness) with their [evaluator IDs](#specifying-evaluators-from-evaluator-library):
241
+
Putting the previous code altogether, you can now submit a cloud evaluation with Azure AI Projects SDK client library via a Python API. See the following example specifying an NLP evaluator (F1 score), AI-assisted quality and safety evaluator (Relevance and Violence), and a custom evaluator (Friendliness) with their [evaluator IDs](#specifying-evaluators-from-evaluator-library):
Copy file name to clipboardExpand all lines: articles/ai-foundry/how-to/develop/evaluate-sdk.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -161,7 +161,7 @@ conversation = {
161
161
162
162
```
163
163
164
-
To run batch evaluations using [local evaluation](#local-evaluation-on-test-datasets-using-evaluate) or [upload your dataset to run cloud evaluation](./cloud-evaluation.md#uploading-evaluation-data), you will need to represent the dataset in `.jsonl` format. The above conversation is equivalent to a line of dataset as following in a `.jsonl` file:
164
+
To run batch evaluations using [local evaluation](#local-evaluation-on-test-datasets-using-evaluate) or [upload your dataset to run cloud evaluation](./cloud-evaluation.md#uploading-evaluation-data), you will need to represent the dataset in `.jsonl` format. The previous conversation is equivalent to a line of dataset as following in a `.jsonl` file:
165
165
166
166
```json
167
167
{"conversation":
@@ -441,7 +441,7 @@ The result of the AI-assisted quality evaluators for a query and response pair i
441
441
To further improve intelligibility, all evaluators accept a binary threshold (unless they output already binary outputs) and output two new keys. For the binarization threshold, a default is set and user can override it. The two new keys are:
442
442
443
443
-`{metric_name}_result` a "pass" or "fail" string based on a binarization threshold.
444
-
-`{metric_name}_threshold` a numerical binarization threshold set by default or by the user
444
+
-`{metric_name}_threshold` a numerical binarization threshold set by default or by the user.
0 commit comments