You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-foundry/concepts/model-benchmarks.md
+42-3Lines changed: 42 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,14 +20,14 @@ author: lgayhardt
20
20
21
21
Model leaderboards (preview) in Azure AI Foundry portal allow you to streamline the model selection process in the Azure AI Foundry [model catalog](../how-to/model-catalog-overview.md). The model leaderboards, backed by industry-standard benchmarks can help you to find the best model for your custom AI solution. From the model leaderboards section of the model catalog, you can [browse leaderboards](https://aka.ms/model-leaderboards) to compare available models as follows:
22
22
23
-
-[Quality, cost, and performance leaderboards](../how-to/benchmark-model-in-catalog.md#access-model-leaderboards) to quickly identify the model leaders along a single metric (quality, cost, or throughput);
23
+
-[Quality, safety, cost, and performance leaderboards](../how-to/benchmark-model-in-catalog.md#access-model-leaderboards) to quickly identify the model leaders along a single metric (quality, safety, cost, or throughput);
24
24
-[Trade-off charts](../how-to/benchmark-model-in-catalog.md#compare-models-in-the-trade-off-charts) to see how models perform on one metric versus another, such as quality versus cost;
25
25
-[Leaderboards by scenario](../how-to/benchmark-model-in-catalog.md#view-leaderboards-by-scenario) to find the best leaderboards that suite your scenario.
26
26
27
27
Whenever you find a model to your liking, you can select it and zoom into the **Detailed benchmarking results** of the model within the model catalog. If satisfied with the model, you can deploy it, try it in the playground, or evaluate it on your data. The leaderboards support benchmarking across text language models (large language models (LLMs) and small language models (SLMs)) and embedding models.
28
28
29
29
30
-
Model benchmarks assess LLMs and SLMs across the following categories: quality, performance, and cost. In addition, we assess the quality of embedding models using standard benchmarks. The leaderboards are updated regularly as better and more unsaturated benchmarks are onboarded, and as new models are added to the model catalog.
30
+
Model benchmarks assess LLMs and SLMs across the following categories: quality, safety, cost, and throughput. In addition, we assess the quality of embedding models using standard benchmarks. The leaderboards are updated regularly as better and more unsaturated benchmarks are onboarded, and as new models are added to the model catalog.
31
31
32
32
33
33
## Quality benchmarks of language models
@@ -40,7 +40,7 @@ Azure AI assesses the quality of LLMs and SLMs using accuracy scores from standa
40
40
41
41
Quality index is provided on a scale of zero to one. Higher values of quality index are better. The datasets included in quality index are:
42
42
43
-
| Dataset Name | Leaderboard Category|
43
+
| Dataset Name | Leaderboard Scenario|
44
44
|--------------------|----------------------|
45
45
| arena_hard | QA |
46
46
| bigbench_hard | Reasoning |
@@ -62,6 +62,45 @@ See more details in accuracy scores:
62
62
Accuracy scores are provided on a scale of zero to one. Higher values are better.
63
63
64
64
65
+
## Safety benchmarks of language models
66
+
67
+
To guide the selection of safety benchmarks for evaluation, we apply a structured filtering and validation process designed to ensure both relevance and rigor. A benchmark qualifies for onboarding if it addresses high-priority risks. For safety leaderboards, we look at different benchmarks that can be considered reliable enough to provide some signals on certain topics of interest as they relate to safety. We select [HarmBench](https://github.com/centerforaisafety/HarmBench) to proxy model safety, and organize scenario leaderboards as follows:
| HarmBench (standard) | Standard harmful behaviors | Attack Success Rate | Lower values means better robustness against attacks designed to illicit standard harmful content |
72
+
| HarmBench (contextual) | Contextually harmful behaviors | Attack Success Rate | Lower values means better robustness against attacks designed to illicit contextually harmful content |
73
+
| HarmBench (copyright violations) | Copyright violations | Attack Success Rate | Lower values means better robustness against attacks designed to illicit copyright violations|
74
+
| WMDP | Knowledge in sensitive domains | Accuracy | Higher values denotes more knowledge in sensitive domains (cybersecurity, biosecurity, and chemical security) |
75
+
| Toxigen | Ability to detect toxic content | F1 Score | Higher values means better ability to detect toxic content |
76
+
77
+
### Model harmful behaviors
78
+
The [HarmBench](https://github.com/centerforaisafety/HarmBench) benchmark measures model harmful behaviors and includes prompts to illicit harmful behavior from model. As it relates to safety, the benchmark covers 7 semantic categories of behavior:
79
+
- Cybercrime & Unauthorized Intrusion
80
+
- Chemical & Biological Weapons/Drugs
81
+
- Copyright Violations
82
+
- Misinformation & Disinformation
83
+
- Harassment & Bullying
84
+
- Illegal Activities
85
+
- General Harm
86
+
87
+
These 7 categories can be summarized into 3 functional categories
88
+
- standard harmful behaviors
89
+
- contextually harmful behaviors
90
+
- copyright violations
91
+
92
+
Each functional category is featured in a separate scenario leaderboard. We use direct prompts from HarmBench (no attacks) and HarmBench evaluators to calculate Attack Success Rate (ASR). Lower ASR values means safer models. We do not explore any attack strategy for evaluation, and model benchmarking is performed with Azure AI Content Safety Filter turned off.
93
+
94
+
95
+
### Model ability to detect toxic content
96
+
[Toxigen](https://github.com/microsoft/TOXIGEN) is a large-scale machine-generated dataset for adversarial and implicit hate speech detection. It contains implicitly toxic and benign sentences mentioning 13 minority groups. We use the annotated samples from Toxigen for evaluation and calculate F1 scores to measure classification performance. Scoring higher on this dataset means that a model is better at detecting toxic content. Model benchmarking is performed with Azure AI Content Safety Filter turned off.
97
+
98
+
### Model knowledge in sensitive domains
99
+
The [Weapons of Mass Destruction Proxy](https://github.com/centerforaisafety/wmdp) (WMDP) benchmark measures model knowledge of in sensitive domains including biosecurity, cybersecurity, and chemical security. The leaderboard uses average accuracy scores across cybersecurity, biosecurity, and chemical security. A higher WMDP accuracy score denotes more knowledge of dangerous capabilities (worse behavior from a safety standpoint). Model benchmarking is performed with the default Azure AI Content Safety filters on. These safety filters detect and block content harm in violence, self-harm, sexual, hate and unfairness, but don't target categories in cybersecurity, biosecurity, and chemical security.
100
+
101
+
### Limitations of safety benchmarks
102
+
We understand and acknowledge that safety is a complex topic and has several dimensions. No single current open-source benchmarks can test or represent the full safety of a system in different scenarios. Additionally, most of these benchmarks suffer from saturation, or misalignment between benchmark design and the risk definition, can lack clear documentation on how the target risks are conceptualized and operationalized, making it difficult to assess whether the benchmark accurately captures the nuances of the risks. This limitation can lead to either overestimating or underestimating model performance in real-world safety scenarios.
103
+
65
104
## Performance benchmarks of language models
66
105
67
106
Performance metrics are calculated as an aggregate over 14 days, based on 24 trails (two requests per trail) sent daily with a one-hour interval between every trail. The following default parameters are used for each request to the model endpoint:
Copy file name to clipboardExpand all lines: articles/ai-foundry/concepts/model-lifecycle-retirement.md
+16Lines changed: 16 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -123,6 +123,22 @@ The following tables list the timelines for models that are on track for retirem
123
123
|[Meta-Llama-3-8B-Instruct](https://ai.azure.com/explore/models/Meta-Llama-3-8B-Instruct/version/9/registry/azureml-meta)| February 28, 2025 | March 31, 2025 | June 30, 2025 |[Meta-Llama-3.1-8B-Instruct](https://ai.azure.com/explore/models/Meta-Llama-3.1-8B-Instruct/version/4/registry/azureml-meta)|
124
124
|[Meta-Llama-3.1-70B-Instruct](https://ai.azure.com/explore/models/Meta-Llama-3.1-70B-Instruct/version/4/registry/azureml-meta)| February 28, 2025 | March 31, 2025 | June 30, 2025 |[Llama-3.3-70B-Instruct](https://ai.azure.com/explore/models/Llama-3.3-70B-Instruct/version/4/registry/azureml-meta)|
125
125
126
+
127
+
#### Microsoft
128
+
129
+
| Model | Legacy date (UTC) | Deprecation date (UTC) | Retirement date (UTC) | Suggested replacement model |
|[Phi-3-medium-4k-instruct](https://ai.azure.com/explore/models/Phi-3-medium-4k-instruct/version/6/registry/azureml)| June 9, 2025 | June 30, 2025 | August 30, 2025 |[Phi-4](https://ai.azure.com/explore/models/Phi-4/version/8/registry/azureml)|
132
+
|[Phi-3-medium-128k-instruct](https://ai.azure.com/explore/models/Phi-3-medium-128k-instruct/version/7/registry/azureml)| June 9, 2025 | June 30, 2025 | August 30, 2025 |[Phi-4](https://ai.azure.com/explore/models/Phi-4/version/8/registry/azureml)|
133
+
|[Phi-3-mini-4k-instruct](https://ai.azure.com/explore/models/Phi-3-mini-4k-instruct/version/15/registry/azureml)| June 9, 2025 | June 30, 2025 | August 30, 2025 |[Phi-4-mini-instruct](https://ai.azure.com/explore/models/Phi-4-mini-instruct/version/1/registry/azureml)|
134
+
|[Phi-3-mini-128k-instruct](https://ai.azure.com/explore/models/Phi-3-mini-128k-instruct/version/13/registry/azureml)| June 9, 2025 | June 30, 2025 | August 30, 2025 |[Phi-4-mini-instruct](https://ai.azure.com/explore/models/Phi-4-mini-instruct/version/1/registry/azureml)|
135
+
|[Phi-3-small-8k-instruct](https://ai.azure.com/explore/models/Phi-3-small-8k-instruct/version/6/registry/azureml)| June 9, 2025 | June 30, 2025 | August 30, 2025 |[Phi-4-mini-instruct](https://ai.azure.com/explore/models/Phi-4-mini-instruct/version/1/registry/azureml)|
136
+
|[Phi-3-small-128k-instruct](https://ai.azure.com/explore/models/Phi-3-small-128k-instruct/version/5/registry/azureml)| June 9, 2025 | June 30, 2025 | August 30, 2025 |[Phi-4-mini-instruct](https://ai.azure.com/explore/models/Phi-4-mini-instruct/version/1/registry/azureml)|
137
+
|[Phi-3.5-mini-instruct](https://ai.azure.com/explore/models/Phi-3.5-mini-instruct/version/6/registry/azureml)| June 9, 2025 | June 30, 2025 | August 30, 2025 |[Phi-4-mini-instruct](https://ai.azure.com/explore/models/Phi-4-mini-instruct/version/1/registry/azureml)|
138
+
|[Phi-3.5-MoE-instruct](https://ai.azure.com/explore/models/Phi-3.5-MoE-instruct/version/5/registry/azureml)| June 9, 2025 | June 30, 2025 | August 30, 2025 |[Phi-4-mini-instruct](https://ai.azure.com/explore/models/Phi-4-mini-instruct/version/1/registry/azureml)|
139
+
|[Phi-3.5-vision-instruct](https://ai.azure.com/explore/models/Phi-3.5-vision-instruct/version/2/registry/azureml)| June 9, 2025 | June 30, 2025 | August 30, 2025 |[Phi-4-mini-instruct](https://ai.azure.com/explore/models/Phi-4-mini-instruct/version/1/registry/azureml)|
140
+
141
+
126
142
#### Mistral AI
127
143
128
144
| Model | Legacy date (UTC) | Deprecation date (UTC) | Retirement date (UTC) | Suggested replacement model |
Copy file name to clipboardExpand all lines: articles/ai-foundry/how-to/benchmark-model-in-catalog.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,8 +20,8 @@ author: lgayhardt
20
20
21
21
In this article, you learn to streamline your model selection process in the Azure AI Foundry [model catalog](../how-to/model-catalog-overview.md) by comparing models in the model leaderboards (preview) available in Azure AI Foundry portal. This comparison can help you make informed decisions about which models meet the requirements for your particular use case or application. You can compare models by viewing the following leaderboards:
22
22
23
-
-[Quality, cost, and performance leaderboards](#access-model-leaderboards) to quickly identify the model leaders along a single metric (quality, cost, or throughput);
24
-
-[Trade-off charts](#compare-models-in-the-trade-off-charts) to see how models perform on one metric versus another, such as quality versus cost;
23
+
-[Quality, safety, cost, and performance leaderboards](#access-model-leaderboards) to quickly identify the model leaders along a single criterion (quality, cost, or throughput);
24
+
-[Trade-off charts](#compare-models-in-the-trade-off-charts) to see how models perform on one metric versus another, such as quality versus cost, among different selection criteria;
25
25
-[Leaderboards by scenario](#view-leaderboards-by-scenario) to find the best leaderboards that suite your scenario.
26
26
27
27
## Prerequisites
@@ -42,7 +42,7 @@ In this article, you learn to streamline your model selection process in the Azu
42
42
43
43
:::image type="content" source="../media/how-to/model-benchmarks/leaderboard-entry.png" alt-text="Screenshot showing the entry point from model catalog into model leaderboards." lightbox="../media/how-to/model-benchmarks/leaderboard-entry.png":::
44
44
45
-
The homepage displays leaderboard highlights for model selection criteria. Quality is the most common criterion for model selection, followed by cost and performance.
45
+
The homepage displays leaderboard highlights for model selection criteria. Quality is the most common criterion for model selection, followed by safety, cost, and performance.
46
46
47
47
:::image type="content" source="../media/how-to/model-benchmarks/leaderboard-highlights.png" alt-text="Screenshot showing the highlighted leaderboards in quality, cost, and performance." lightbox="../media/how-to/model-benchmarks/leaderboard-highlights.png":::
48
48
@@ -52,7 +52,7 @@ In this article, you learn to streamline your model selection process in the Azu
52
52
Trade-off charts allow you to compare models based on the criteria that you care more about. Suppose you care more about cost than quality and you discover that the highest quality model isn't the cheapest model, you might need to make trade-offs among quality, cost, and performance criteria. In the trade-off charts, you can compare how models perform along two metrics at a glance.
53
53
54
54
1. Select the **Models selected** dropdown menu to add or remove models from the trade-off chart.
55
-
1. Select the **Quality vs. Throughput** tab and the **Throughput vs Cost** tab to view those charts for your selected models.
55
+
1. Select the **Quality vs. Cost** tab and the **Quality vs Throughput** tab to view those charts for your selected models.
56
56
1. Select **Compare between metrics** to access more detailed results for each model.
57
57
58
58
:::image type="content" source="../media/how-to/model-benchmarks/leaderboard-trade-off.png" alt-text="Screenshot showing the trade-off charts in quality, cost, and performance." lightbox="../media/how-to/model-benchmarks/leaderboard-trade-off.png":::
Copy file name to clipboardExpand all lines: articles/ai-foundry/how-to/prompt-flow-troubleshoot.md
+2-6Lines changed: 2 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -67,16 +67,12 @@ You might encounter a 429 error from Azure OpenAI. This error means that you rea
67
67
68
68
If you see the message `request canceled` in the logs, it might be because the OpenAI API call is taking too long and exceeding the time-out limit.
69
69
70
-
A network issue or a complex request that requires more processing time might cause the OpenAI time out. For more information, see [OpenAI API time out](https://help.openai.com/en/articles/6897186-timeout).
71
-
70
+
A network issue or a complex request that requires more processing time might cause the OpenAI time out.
71
+
72
72
Wait a few seconds and retry your request. This action usually resolves any network issues.
73
73
74
74
If retrying doesn't work, check whether you're using a long context model, such as `gpt-4-32k`, and set a large value for `max_tokens`. If so, the behavior is expected because your prompt might generate a long response that takes longer than the interactive mode's upper threshold. In this situation, we recommend that you try `Bulk test` because this mode doesn't have a time-out setting.
75
75
76
-
1. If you can't find anything in logs to indicate that it's a specific node issue:
77
-
78
-
- Contact the prompt flow team ([promptflow-eng](mailto:[email protected])) with the logs. We try to identify the root cause.
79
-
80
76
## Compute session failures that use a custom base image: Flow deployment-related issues
81
77
82
78
### How do I resolve an upstream request time-out issue?
0 commit comments