Skip to content

Commit 8f38e86

Browse files
authored
Update model-benchmarks.md
added safety leaderboards
1 parent a5b1bd0 commit 8f38e86

File tree

1 file changed

+35
-1
lines changed

1 file changed

+35
-1
lines changed

articles/ai-foundry/concepts/model-benchmarks.md

Lines changed: 35 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ Azure AI assesses the quality of LLMs and SLMs using accuracy scores from standa
4040

4141
Quality index is provided on a scale of zero to one. Higher values of quality index are better. The datasets included in quality index are:
4242

43-
| Dataset Name | Leaderboard Category |
43+
| Dataset Name | Leaderboard Scenario |
4444
|--------------------|----------------------|
4545
| arena_hard | QA |
4646
| bigbench_hard | Reasoning |
@@ -62,6 +62,40 @@ See more details in accuracy scores:
6262
Accuracy scores are provided on a scale of zero to one. Higher values are better.
6363

6464

65+
## Safety benchmarks of language models
66+
67+
To guide the selection of safety benchmarks for evaluation, we apply a structured filtering and validation process designed to ensure both relevance and rigor. A benchmark qualifies for onboarding if it addresses high-priority risks. For safety leaderboards, we look at different benchmarks that can be considered reliable enough to provide some signals on certain topics of interest as they relate to safety. We select [HarmBench](https://github.com/centerforaisafety/HarmBench) to proxy model safety, and organize scenario leaderboards as follows:
68+
69+
| Dataset Name | Leaderboard Scenario |
70+
|--------------------|----------------------|
71+
| HarmBench (standard) | Standard harmful behaviors |
72+
| HarmBench (contextual) | Contextually harmful behaviors |
73+
| HarmBench (copyright violations) | Copyright violations |
74+
| WMDP | Knowledge in sensitive domains |
75+
| Toxigen | Ability to detect toxic content |
76+
77+
### Model harmful behaviors
78+
The [HarmBench](https://github.com/centerforaisafety/HarmBench) benchmark measures model harmful behaviors and includes prompts to illicit harmful behaviour from model. As it relates to safety, the benchmark covers 7 semantic categories of behaviour:
79+
- Cybercrime & Unauthorized Intrusion
80+
- Chemical & Biological Weapons/Drugs
81+
- Copyright Violations
82+
- Misinformation & Disinformation
83+
- Harassment & Bullying
84+
- Illegal Activities
85+
- General Harm
86+
87+
These 7 categories can be summarized into 3 functional categories capturing standard harmful behaviors, contextually harmful behaviors, and copyright violations, which we surface in 3 separate scenario leaderboards. We use direct prompts from HarmBench (no attacks) and HarmBench evaluators to calculate Attack Success Rate (ASR). Lower ASR values means safer models. We do not explore any attack strategy for evaluation, and model benchmarking is performed with Azure AI Content Safety Filter turned off.
88+
89+
90+
### Model ability to detect toxic content
91+
[Toxigen](https://github.com/microsoft/TOXIGEN) is a large-scale machine-generated dataset for adversarial and implicit hate speech detection. It contains implicitly toxic and benign sentences mentioning 13 minority groups. We use the annotated samples from Toxigen for evaluation and calculate accuracy scores to measure performance. Higher accuracy is better, because scoring high on this dataset model is better ability to detect toxic content. Model benchmarking is performed with Azure AI Content Safety Filter turned off.
92+
93+
### Model knowledge in sensitive domains
94+
The [Weapons of Mass Destruction Proxy](https://github.com/centerforaisafety/wmdp) (WMDP) benchmark measures model knowledge of in sensitive domains including biosecurity, cybersecurity, and chemical security. The leaderboard uses average accuracy across cybersecurity, biosecurity, and chemical security. A higher WMDP accuracy score denotes more knowledge of dangerous capabilities (i.e., worse behaviour from a safety standpoint). Model benchmarking is performed with the default Azure AI Content Safety filters on. These safety filters detect and block content harm in violence, self-harm, sexual, hate and unfaireness, but do not target categories in cybersecurity, biosecurity, chemical security.
95+
96+
### Limitations of safety benchmarks
97+
We understand and acknowledge that safety is a complex topic and has several dimensions. No single current open-source benchmarks can test or represent the full safety of a system in different scenarios. Additionally, most of these benchmarks suffer from saturation, or misalignment between benchmark design and the risk definition, can lack clear documentation on how the target risks are conceptualized and operationalized, making it difficult to assess whether the benchmark accurately captures the nuances of the risks. This limitation can lead to either overestimating or underestimating model performance in real-world safety scenarios.
98+
6599
## Performance benchmarks of language models
66100

67101
Performance metrics are calculated as an aggregate over 14 days, based on 24 trails (two requests per trail) sent daily with a one-hour interval between every trail. The following default parameters are used for each request to the model endpoint:

0 commit comments

Comments
 (0)