You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-foundry/concepts/model-benchmarks.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -72,7 +72,7 @@ To guide the selection of safety benchmarks for evaluation, we apply a structure
72
72
| HarmBench (contextual) | Contextually harmful behaviors | Attack Success Rate | Lower values means better robustness against attacks designed to illicit contextually harmful content |
73
73
| HarmBench (copyright violations) | Copyright violations | Attack Success Rate | Lower values means better robustness against attacks designed to illicit copyright violations|
74
74
| WMDP | Knowledge in sensitive domains | Accuracy | Higher values denotes more knowledge in sensitive domains (cybersecurity, biosecurity, and chemical security) |
75
-
| Toxigen | Ability to detect toxic content |Accuracy| Higher values means better ability to detect toxic content |
75
+
| Toxigen | Ability to detect toxic content |F1 Score| Higher values means better ability to detect toxic content |
76
76
77
77
### Model harmful behaviors
78
78
The [HarmBench](https://github.com/centerforaisafety/HarmBench) benchmark measures model harmful behaviors and includes prompts to illicit harmful behavior from model. As it relates to safety, the benchmark covers 7 semantic categories of behavior:
@@ -93,10 +93,10 @@ Each functional category is featured in a separate scenario leaderboard. We use
93
93
94
94
95
95
### Model ability to detect toxic content
96
-
[Toxigen](https://github.com/microsoft/TOXIGEN) is a large-scale machine-generated dataset for adversarial and implicit hate speech detection. It contains implicitly toxic and benign sentences mentioning 13 minority groups. We use the annotated samples from Toxigen for evaluation and calculate accuracy scores to measure performance. Higher accuracy is better, because scoring high on this dataset model is better ability to detect toxic content. Model benchmarking is performed with Azure AI Content Safety Filter turned off.
96
+
[Toxigen](https://github.com/microsoft/TOXIGEN) is a large-scale machine-generated dataset for adversarial and implicit hate speech detection. It contains implicitly toxic and benign sentences mentioning 13 minority groups. We use the annotated samples from Toxigen for evaluation and calculate F1 scores to measure classification performance. Scoring higher on this dataset means that a model is better at detecting toxic content. Model benchmarking is performed with Azure AI Content Safety Filter turned off.
97
97
98
98
### Model knowledge in sensitive domains
99
-
The [Weapons of Mass Destruction Proxy](https://github.com/centerforaisafety/wmdp) (WMDP) benchmark measures model knowledge of in sensitive domains including biosecurity, cybersecurity, and chemical security. The leaderboard uses average accuracy across cybersecurity, biosecurity, and chemical security. A higher WMDP accuracy score denotes more knowledge of dangerous capabilities (worse behavior from a safety standpoint). Model benchmarking is performed with the default Azure AI Content Safety filters on. These safety filters detect and block content harm in violence, self-harm, sexual, hate and unfairness, but don't target categories in cybersecurity, biosecurity, and chemical security.
99
+
The [Weapons of Mass Destruction Proxy](https://github.com/centerforaisafety/wmdp) (WMDP) benchmark measures model knowledge of in sensitive domains including biosecurity, cybersecurity, and chemical security. The leaderboard uses average accuracy scores across cybersecurity, biosecurity, and chemical security. A higher WMDP accuracy score denotes more knowledge of dangerous capabilities (worse behavior from a safety standpoint). Model benchmarking is performed with the default Azure AI Content Safety filters on. These safety filters detect and block content harm in violence, self-harm, sexual, hate and unfairness, but don't target categories in cybersecurity, biosecurity, and chemical security.
100
100
101
101
### Limitations of safety benchmarks
102
102
We understand and acknowledge that safety is a complex topic and has several dimensions. No single current open-source benchmarks can test or represent the full safety of a system in different scenarios. Additionally, most of these benchmarks suffer from saturation, or misalignment between benchmark design and the risk definition, can lack clear documentation on how the target risks are conceptualized and operationalized, making it difficult to assess whether the benchmark accurately captures the nuances of the risks. This limitation can lead to either overestimating or underestimating model performance in real-world safety scenarios.
0 commit comments