Skip to content

Commit b042005

Browse files
authored
Update model-benchmarks.md
1 parent 40f1b82 commit b042005

File tree

1 file changed

+16
-11
lines changed

1 file changed

+16
-11
lines changed

articles/ai-foundry/concepts/model-benchmarks.md

Lines changed: 16 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -20,14 +20,14 @@ author: lgayhardt
2020

2121
Model leaderboards (preview) in Azure AI Foundry portal allow you to streamline the model selection process in the Azure AI Foundry [model catalog](../how-to/model-catalog-overview.md). The model leaderboards, backed by industry-standard benchmarks can help you to find the best model for your custom AI solution. From the model leaderboards section of the model catalog, you can [browse leaderboards](https://aka.ms/model-leaderboards) to compare available models as follows:
2222

23-
- [Quality, cost, and performance leaderboards](../how-to/benchmark-model-in-catalog.md#access-model-leaderboards) to quickly identify the model leaders along a single metric (quality, cost, or throughput);
23+
- [Quality, safety, cost, and performance leaderboards](../how-to/benchmark-model-in-catalog.md#access-model-leaderboards) to quickly identify the model leaders along a single metric (quality, safety, cost, or throughput);
2424
- [Trade-off charts](../how-to/benchmark-model-in-catalog.md#compare-models-in-the-trade-off-charts) to see how models perform on one metric versus another, such as quality versus cost;
2525
- [Leaderboards by scenario](../how-to/benchmark-model-in-catalog.md#view-leaderboards-by-scenario) to find the best leaderboards that suite your scenario.
2626

2727
Whenever you find a model to your liking, you can select it and zoom into the **Detailed benchmarking results** of the model within the model catalog. If satisfied with the model, you can deploy it, try it in the playground, or evaluate it on your data. The leaderboards support benchmarking across text language models (large language models (LLMs) and small language models (SLMs)) and embedding models.
2828

2929

30-
Model benchmarks assess LLMs and SLMs across the following categories: quality, performance, and cost. In addition, we assess the quality of embedding models using standard benchmarks. The leaderboards are updated regularly as better and more unsaturated benchmarks are onboarded, and as new models are added to the model catalog.
30+
Model benchmarks assess LLMs and SLMs across the following categories: quality, safety, cost, and throughput. In addition, we assess the quality of embedding models using standard benchmarks. The leaderboards are updated regularly as better and more unsaturated benchmarks are onboarded, and as new models are added to the model catalog.
3131

3232

3333
## Quality benchmarks of language models
@@ -66,13 +66,13 @@ Accuracy scores are provided on a scale of zero to one. Higher values are better
6666

6767
To guide the selection of safety benchmarks for evaluation, we apply a structured filtering and validation process designed to ensure both relevance and rigor. A benchmark qualifies for onboarding if it addresses high-priority risks. For safety leaderboards, we look at different benchmarks that can be considered reliable enough to provide some signals on certain topics of interest as they relate to safety. We select [HarmBench](https://github.com/centerforaisafety/HarmBench) to proxy model safety, and organize scenario leaderboards as follows:
6868

69-
| Dataset Name | Leaderboard Scenario |
70-
|--------------------|----------------------|
71-
| HarmBench (standard) | Standard harmful behaviors |
72-
| HarmBench (contextual) | Contextually harmful behaviors |
73-
| HarmBench (copyright violations) | Copyright violations |
74-
| WMDP | Knowledge in sensitive domains |
75-
| Toxigen | Ability to detect toxic content |
69+
| Dataset Name | Leaderboard Scenario | Metric | Interpretation |
70+
|--------------------|----------------------|----------------------|----------------------|
71+
| HarmBench (standard) | Standard harmful behaviors | Attack Success Rate | Lower values means better robustness against attacks designed to illicit standard harmful content |
72+
| HarmBench (contextual) | Contextually harmful behaviors | Attack Success Rate | Lower | Lower values means better robustness against attacks designed to illicit contextually harmful content |
73+
| HarmBench (copyright violations) | Copyright violations | Attack Success Rate | Lower | Lower values means better robustness against attacks designed to illicit copyright violations|
74+
| WMDP | Knowledge in sensitive domains | Accuracy | Higher | Higher values denotes more knowledge in sensitive domains (cybersecurity, biosecurity, and chemical security) |
75+
| Toxigen | Ability to detect toxic content | Accuracy | Higher | Higher values means better ability to detect toxic content |
7676

7777
### Model harmful behaviors
7878
The [HarmBench](https://github.com/centerforaisafety/HarmBench) benchmark measures model harmful behaviors and includes prompts to illicit harmful behavior from model. As it relates to safety, the benchmark covers 7 semantic categories of behavior:
@@ -84,14 +84,19 @@ The [HarmBench](https://github.com/centerforaisafety/HarmBench) benchmark measur
8484
- Illegal Activities
8585
- General Harm
8686

87-
These 7 categories can be summarized into 3 functional categories capturing standard harmful behaviors, contextually harmful behaviors, and copyright violations, which we surface in 3 separate scenario leaderboards. We use direct prompts from HarmBench (no attacks) and HarmBench evaluators to calculate Attack Success Rate (ASR). Lower ASR values means safer models. We do not explore any attack strategy for evaluation, and model benchmarking is performed with Azure AI Content Safety Filter turned off.
87+
These 7 categories can be summarized into 3 functional categories
88+
- standard harmful behaviors
89+
- contextually harmful behaviors
90+
- copyright violations
91+
92+
Each functional category is featured in a separate scenario leaderboard. We use direct prompts from HarmBench (no attacks) and HarmBench evaluators to calculate Attack Success Rate (ASR). Lower ASR values means safer models. We do not explore any attack strategy for evaluation, and model benchmarking is performed with Azure AI Content Safety Filter turned off.
8893

8994

9095
### Model ability to detect toxic content
9196
[Toxigen](https://github.com/microsoft/TOXIGEN) is a large-scale machine-generated dataset for adversarial and implicit hate speech detection. It contains implicitly toxic and benign sentences mentioning 13 minority groups. We use the annotated samples from Toxigen for evaluation and calculate accuracy scores to measure performance. Higher accuracy is better, because scoring high on this dataset model is better ability to detect toxic content. Model benchmarking is performed with Azure AI Content Safety Filter turned off.
9297

9398
### Model knowledge in sensitive domains
94-
The [Weapons of Mass Destruction Proxy](https://github.com/centerforaisafety/wmdp) (WMDP) benchmark measures model knowledge of in sensitive domains including biosecurity, cybersecurity, and chemical security. The leaderboard uses average accuracy across cybersecurity, biosecurity, and chemical security. A higher WMDP accuracy score denotes more knowledge of dangerous capabilities (worse behavior from a safety standpoint). Model benchmarking is performed with the default Azure AI Content Safety filters on. These safety filters detect and block content harm in violence, self-harm, sexual, hate and unfairness, but don't target categories in cybersecurity, biosecurity, chemical security.
99+
The [Weapons of Mass Destruction Proxy](https://github.com/centerforaisafety/wmdp) (WMDP) benchmark measures model knowledge of in sensitive domains including biosecurity, cybersecurity, and chemical security. The leaderboard uses average accuracy across cybersecurity, biosecurity, and chemical security. A higher WMDP accuracy score denotes more knowledge of dangerous capabilities (worse behavior from a safety standpoint). Model benchmarking is performed with the default Azure AI Content Safety filters on. These safety filters detect and block content harm in violence, self-harm, sexual, hate and unfairness, but don't target categories in cybersecurity, biosecurity, and chemical security.
95100

96101
### Limitations of safety benchmarks
97102
We understand and acknowledge that safety is a complex topic and has several dimensions. No single current open-source benchmarks can test or represent the full safety of a system in different scenarios. Additionally, most of these benchmarks suffer from saturation, or misalignment between benchmark design and the risk definition, can lack clear documentation on how the target risks are conceptualized and operationalized, making it difficult to assess whether the benchmark accurately captures the nuances of the risks. This limitation can lead to either overestimating or underestimating model performance in real-world safety scenarios.

0 commit comments

Comments
 (0)