You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-foundry/how-to/benchmark-model-in-catalog.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -107,7 +107,7 @@ To access benchmark results for a specific metric and dataset:
107
107
The previous sections showed the benchmark results calculated by Microsoft, using public datasets. However, you can try to regenerate the same set of metrics with your data.
108
108
109
109
1. Return to the **Benchmarks** tab in the model card.
110
-
1. Select **Try with your own data** to [evaluate the model with your data](evaluate-generative-ai-app.md#fine-tuned-model-evaluation). Evaluation on your data helps you see how the model performs in your particular scenarios.
110
+
1. Select **Try with your own data** to [evaluate the model with your data](evaluate-generative-ai-app.md#model-evaluation). Evaluation on your data helps you see how the model performs in your particular scenarios.
111
111
112
112
:::image type="content" source="../media/how-to/model-benchmarks/try-with-your-own-data.png" alt-text="Screenshot showing the button to select for evaluating with your own data." lightbox="../media/how-to/model-benchmarks/try-with-your-own-data.png":::
@@ -32,8 +32,6 @@ An evaluation run allows you to generate metric outputs for each data row in you
32
32
33
33
From the collapsible left menu, select **Evaluation** > **Create a new evaluation**.
34
34
35
-
:::image type="content" source="../media/evaluations/evaluate/create-new-evaluation.png" alt-text="Screenshot of the button to create a new evaluation." lightbox="../media/evaluations/evaluate/create-new-evaluation.png":::
36
-
37
35
### From the model catalog page
38
36
39
37
1. From the collapsible left menu, select **Model catalog**.
@@ -47,11 +45,9 @@ From the collapsible left menu, select **Evaluation** > **Create a new evaluatio
47
45
48
46
When you start an evaluation from the **Evaluate** page, you first need to choose the evaluation target. By specifying the appropriate evaluation target, we can tailor the evaluation to the specific nature of your application, ensuring accurate and relevant metrics. We support two types of evaluation targets:
49
47
50
-
-**Fine-tuned model**: This choice evaluates the output generated by your selected model and user-defined prompt.
48
+
-**Model**: This choice evaluates the output generated by your selected model and user-defined prompt.
51
49
-**Dataset**: Your model-generated outputs are already in a test dataset.
52
50
53
-
:::image type="content" source="../media/evaluations/evaluate/select-evaluation-target.png" alt-text="Screenshot of the evaluation target selection." lightbox="../media/evaluations/evaluate/select-evaluation-target.png":::
54
-
55
51
#### Configure test data
56
52
57
53
When you enter the evaluation creation wizard, you can select from preexisting datasets or upload a new dataset to evaluate. The test dataset needs to have the model-generated outputs to be used for evaluation. A preview of your test data is shown on the right pane.
@@ -72,8 +68,6 @@ We support three types of metrics curated by Microsoft to facilitate a comprehen
72
68
-**AI quality (NLP)**: These natural language processing (NLP) metrics are mathematical-based, and they also evaluate the overall quality of the generated content. They often require ground truth data, but they don't require a model deployment as judge.
73
69
-**Risk and safety metrics**: These metrics focus on identifying potential content risks and ensuring the safety of the generated content.
74
70
75
-
:::image type="content" source="../media/evaluations/evaluate/testing-criteria.png" alt-text="Screenshot that shows how to add testing criteria." lightbox="../media/evaluations/evaluate/testing-criteria.png":::
76
-
77
71
As you add your testing criteria, different metrics are going to be used as part of the evaluation. You can refer to the table for the complete list of metrics we offer support for in each scenario. For more in-depth information on metric definitions and how they're calculated, see [What are evaluators?](../concepts/observability.md#what-are-evaluators).
78
72
79
73
| AI quality (AI assisted) | AI quality (NLP) | Risk and safety metrics |
@@ -143,15 +137,11 @@ For guidance on the specific data mapping requirements for each metric, refer to
143
137
144
138
After you complete all the necessary configurations, you can provide an optional name for your evaluation. Then you can review and select **Submit** to submit the evaluation run.
145
139
146
-
:::image type="content" source="../media/evaluations/evaluate/review-and-finish.png" alt-text="Screenshot that shows the review page to create a new evaluation." lightbox="../media/evaluations/evaluate/review-and-finish.png":::
147
-
148
-
### Fine-tuned model evaluation
140
+
### Model evaluation
149
141
150
142
To create a new evaluation for your selected model deployment, you can use a GPT model to generate sample questions, or you can select from your established dataset collection.
151
143
152
-
:::image type="content" source="../media/evaluations/evaluate/select-data-source.png" alt-text="Screenshot that shows how to select a data source in Create a new evaluation." lightbox="../media/evaluations/evaluate/select-data-source.png":::
153
-
154
-
#### Configure test data for a fine-tuned model
144
+
#### Configure test data for a model
155
145
156
146
Set up the test dataset that's used for evaluation. This dataset is sent to the model to generate responses for assessment. You have two options for configuring your test data:
157
147
@@ -176,8 +166,6 @@ To configure your test criteria, select **Next**. As you select your criteria, m
176
166
177
167
After you select the test criteria you want, you can review the evaluation, optionally change the name of the evaluation, and then select **Submit**. Go to the evaluation page to see the results.
178
168
179
-
:::image type="content" source="../media/evaluations/evaluate/review-model-evaluation.png" alt-text="Screenshot that shows the Review evaluation option." lightbox="../media/evaluations/evaluate/review-model-evaluation.png":::
180
-
181
169
> [!NOTE]
182
170
> The generated dataset is saved to the project’s blob storage after the evaluation run is created.
183
171
@@ -189,8 +177,6 @@ The evaluator library also enables version management. You can compare different
189
177
190
178
To use the evaluator library in Azure AI Foundry portal, go to your project's **Evaluation** page and select the **Evaluator library** tab.
191
179
192
-
:::image type="content" source="../media/evaluations/evaluate/evaluator-library-list.png" alt-text="Screenshot that shows the page where you select evaluators from the evaluator library." lightbox="../media/evaluations/evaluate/evaluator-library-list.png":::
193
-
194
180
You can select the evaluator name to see more details. You can see the name, description, and parameters, and check any files associated with the evaluator. Here are some examples of Microsoft-curated evaluators:
195
181
196
182
- For performance and quality evaluators curated by Microsoft, you can view the annotation prompt on the details page. You can adapt these prompts to your own use case. Change the parameters or criteria according to your data and objectives in the Azure AI Evaluation SDK. For example, you can select **Groundedness-Evaluator** and check the Prompty file that shows how we calculate the metric.
Copy file name to clipboardExpand all lines: articles/ai-foundry/how-to/evaluate-results.md
+2-3Lines changed: 2 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@ ms.custom:
8
8
- build-2024
9
9
- ignite-2024
10
10
ms.topic: how-to
11
-
ms.date: 09/15/2025
11
+
ms.date: 09/22/2025
12
12
ms.reviewer: mithigpe
13
13
ms.author: lagayhar
14
14
author: lgayhardt
@@ -33,8 +33,6 @@ In this article, you learn how to:
33
33
34
34
After you submit an evaluation, locate the run on the **Evaluation** page. Filter or adjust columns to focus on runs of interest. Review high‑level metrics at a glance before drilling in.
35
35
36
-
:::image type="content" source="../media/evaluations/view-results/evaluation-run-list.png" alt-text="Screenshot that shows the evaluation run list." lightbox="../media/evaluations/view-results/evaluation-run-list.png":::
37
-
38
36
> [!TIP]
39
37
> You can view an evaluation run with any version of the `promptflow-evals` SDK or `azure-ai-evaluation` versions 1.0.0b1, 1.0.0b2, 1.0.0b3. Enable the **Show all runs** toggle to locate the run.
40
38
@@ -64,6 +62,7 @@ In the **Metric dashboard** section, aggregate views are broken down by metrics
64
62
Use the table under the dashboard to inspect each data sample. Sort by a metric to surface worst‑performing samples and identify systematic gaps (incorrect results, safety failures, latency). Use search to cluster related failure topics. Apply column customization to focus on key metrics.
65
63
66
64
Typical actions:
65
+
67
66
- Filter for low scores to detect recurring patterns.
68
67
- Adjust prompts or fine-tune when systemic gaps appear.
0 commit comments