Skip to content

Commit 7ec11d6

Browse files
committed
Chang's edits to concept article
1 parent a47d000 commit 7ec11d6

File tree

1 file changed

+23
-6
lines changed

1 file changed

+23
-6
lines changed

articles/ai-foundry/concepts/model-benchmarks.md

Lines changed: 23 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ ms.service: azure-ai-foundry
77
ms.custom:
88
- ai-learning-hub
99
ms.topic: concept-article
10-
ms.date: 04/03/2025
10+
ms.date: 04/04/2025
1111
ms.reviewer: changliu2
1212
ms.author: mopeakande
1313
author: msakande
@@ -38,11 +38,28 @@ Azure AI assesses the quality of LLMs and SLMs using accuracy scores from standa
3838

3939
| Index | Description |
4040
|-------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
41-
| Quality index | Quality index is calculated by averaging applicable accuracy scores (exact_match, pass@1, arena_hard) over 15 standard datasets of applicable accuracy scores. Datasets include BoolQ, HellaSwag, BoolQ, HellaSwag, OpenBookQA, PIQA, Social IQA, Winogrande, SQuAD v2, TruthfulQA (Gen), TruthfulQA (MC), HumanEval, GSM8K, MMLU (Humanities), MMLU (Other), MMLU (Social Sciences), MMLU (STEM). |
42-
43-
Quality index is provided on a scale of zero to one. Higher values of quality index are better.
44-
45-
For accuracy scores:
41+
| Quality index | Quality index is calculated by averaging applicable accuracy scores (exact_match, pass@1, arena_hard) over comprehensive, standard benchmark datasets. |
42+
43+
Quality index is provided on a scale of zero to one. Higher values of quality index are better. The datasets included in quality index are:
44+
45+
| Dataset name | Leaderboard category |
46+
|-------------------------|---------------------|
47+
| BoolQ | QA |
48+
| HellaSwag | Reasoning |
49+
| OpenBookQA | Reasoning |
50+
| PIQA | Reasoning |
51+
| Social IQA | Reasoning |
52+
| Winogrande | Reasoning |
53+
| TruthfulQA (MC) | Groundedness |
54+
| HumanEval | Coding |
55+
| GSM8K | Math |
56+
| MMLU (Humanities) | General Knowledge |
57+
| MMLU (Other) | General Knowledge |
58+
| MMLU (Social Sciences) | General Knowledge |
59+
| MMLU (STEM) | General Knowledge |
60+
61+
62+
See more details in accuracy scores:
4663

4764
| Metric | Description |
4865
|--------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|

0 commit comments

Comments
 (0)