You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-foundry/concepts/model-benchmarks.md
+19-15Lines changed: 19 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,18 +20,17 @@ author: lgayhardt
20
20
21
21
Model leaderboards (preview) in Azure AI Foundry portal allow you to streamline the model selection process in the Azure AI Foundry [model catalog](../how-to/model-catalog-overview.md). The model leaderboards, backed by industry-standard benchmarks can help you to find the best model for your custom AI solution. From the model leaderboards section of the model catalog, you can [browse leaderboards](https://aka.ms/model-leaderboards) to compare available models as follows:
22
22
23
-
-**Quality, cost, and performance leaderboards** to quickly identify the model leaders along a single metric (quality, cost, or throughput);
24
-
-**Trade-off charts** to see how models perform on one metric versus another, such as quality versus cost;
25
-
-**Leaderboards by scenario** to find the best leaderboards that suite your scenario.
23
+
-[Quality, cost, and performance leaderboards](../how-to/benchmark-model-in-catalog.md#access-model-leaderboards) to quickly identify the model leaders along a single metric (quality, cost, or throughput);
24
+
-[Trade-off charts](../how-to/benchmark-model-in-catalog.md#compare-models-in-the-trade-off-charts) to see how models perform on one metric versus another, such as quality versus cost;
25
+
-[Leaderboards by scenario](../how-to/benchmark-model-in-catalog.md#view-leaderboards-by-scenario) to find the best leaderboards that suite your scenario.
26
26
27
27
Whenever you find a model to your liking, you can select it and zoom into the **Detailed benchmarking results** of the model within the model catalog. If satisfied with the model, you can deploy it, try it in the playground, or evaluate it on your data. The leaderboards support benchmarking across text language models (large language models (LLMs) and small language models (SLMs)) and embedding models.
28
28
29
29
30
-
## Benchmarking of large and small language models
30
+
Model benchmarks assess LLMs and SLMs across the following categories: quality, performance, and cost. In addition, we assess the quality of embedding models using standard benchmarks. The benchmarks are updated regularly as better and more unsaturated datasets and associated metrics are added to existing models, and as new models are added to the model catalog.
31
31
32
-
Model benchmarks assess LLMs and SLMs across the following categories: quality, performance, and cost. The benchmarks are updated regularly as new datasets and associated metrics are added to existing models, and as new models are added to the model catalog.
33
32
34
-
###Quality
33
+
## Quality benchmarks of language models
35
34
36
35
Azure AI assesses the quality of LLMs and SLMs using accuracy scores from standard, comprehensive benchmark datasets measuring model capabilities such as reasoning, knowledge, question answering, math, and coding.
37
36
@@ -67,7 +66,14 @@ See more details in accuracy scores:
67
66
Accuracy scores are provided on a scale of zero to one. Higher values are better.
68
67
69
68
70
-
### Performance
69
+
## Safety benchmarks of language models
70
+
71
+
Safety benchmarks use a standard metric Attack Success Rate to measure how vulerable language models are to attacks in biosecurity, cybersecurity, and chemical security. Currently, the [Weapons of Mass Destruction Proxy (WMDP) benchmark](https://www.wmdp.ai/) is used to assess hazardous knowledge in language models. The lower the Attack Success Rate is, the safer is the model response.
72
+
73
+
All model endpoints are benchmarked with the default Azure AI Content Safety filters on with a default configuration. These safety filters detect and block [content harm categories](../../ai-services/content-safety/concepts/harm-categories.md) in violence, self-harm, sexual, hate and unfaireness, but do not measure categories in cybersecurity, biosecurity, chemical security.
74
+
75
+
76
+
## Performance benchmarks of language models
71
77
72
78
Performance metrics are calculated as an aggregate over 14 days, based on 24 trails (two requests per trail) sent daily with a one-hour interval between every trail. The following default parameters are used for each request to the model endpoint:
73
79
@@ -109,7 +115,7 @@ Azure AI also displays performance indexes for latency and throughput as follows
109
115
110
116
For performance metrics like latency or throughput, the time to first token and the generated tokens per second give a better overall sense of the typical performance and behavior of the model. We refresh our performance numbers on regular cadence.
111
117
112
-
###Cost
118
+
## Cost benchmarks of language models
113
119
114
120
Cost calculations are estimates for using an LLM or SLM model endpoint hosted on the Azure AI platform. Azure AI supports displaying the cost of serverless APIs and Azure OpenAI models. Because these costs are subject to change, we refresh our cost calculations on a regular cadence.
115
121
@@ -127,13 +133,11 @@ Azure AI also displays the cost index as follows:
127
133
|-------|-------------|
128
134
| Cost index | Estimated cost. Lower values are better. |
129
135
130
-
## Benchmarking of embedding models
131
-
132
-
Model benchmarks assess embedding models based on quality.
136
+
## Quality benchmarks of embedding models
133
137
134
-
### Quality
138
+
The quality index of embedding models is defined as the averaged accuracy scores of a comprehensive set of standard benchmark datasests targeting Information Retrieval, Document Clustering, and Summarization tasks.
135
139
136
-
The quality of embedding models is assessed across the following metrics:
140
+
See more details in accuracy score definitions specific to each dataset:
137
141
138
142
| Metric | Description |
139
143
|--------|-------------|
@@ -145,9 +149,9 @@ The quality of embedding models is assessed across the following metrics:
145
149
| Spearman correlation | Spearman correlation based on cosine similarity is calculated by first computing the cosine similarity between variables, then ranking these scores and using the ranks to compute the Spearman correlation. |
146
150
| V measure | V measure is a metric used to evaluate the quality of clustering. V measure is calculated as a harmonic mean of homogeneity and completeness, ensuring a balance between the two for a meaningful score. Possible scores lie between zero and one, with one being perfectly complete labeling. |
147
151
148
-
###Calculation of scores
152
+
## Calculation of scores
149
153
150
-
####Individual scores
154
+
### Individual scores
151
155
152
156
Benchmark results originate from public datasets that are commonly used for language model evaluation. In most cases, the data is hosted in GitHub repositories maintained by the creators or curators of the data. Azure AI evaluation pipelines download data from their original sources, extract prompts from each example row, generate model responses, and then compute relevant accuracy metrics.
4. Go to the **Model leaderboards** section of the model catalog. This section displays the top three model leaders ranked along [quality](../concepts/model-benchmarks.md#quality), [cost](../concepts/model-benchmarks.md#cost), and [performance](../concepts/model-benchmarks.md#performance). You can select any of these models to check out more details.
38
+
4. Go to the **Model leaderboards** section of the model catalog. This section displays the top three model leaders ranked along [quality](../concepts/model-benchmarks.md#quality-benchmarks-of-language-models), [cost](../concepts/model-benchmarks.md#cost-benchmarks-of-language-models), and [performance](../concepts/model-benchmarks.md#performance-benchmarks-of-language-models). You can select any of these models to check out more details.
39
39
40
40
:::image type="content" source="../media/how-to/model-benchmarks/leaderboard-entry-select-model.png" alt-text="Screenshot showing the selected model from entry point of leaderboards on the model catalog homepage." lightbox="../media/how-to/model-benchmarks/leaderboard-entry-select-model.png":::
0 commit comments