Skip to content

Commit 4988a39

Browse files
Merge pull request #4610 from changliu2/leaderboard-build-2025
leaderboard doc updates for //Build
2 parents de44f00 + 528e879 commit 4988a39

File tree

2 files changed

+20
-16
lines changed

2 files changed

+20
-16
lines changed

articles/ai-foundry/concepts/model-benchmarks.md

Lines changed: 19 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -20,18 +20,17 @@ author: lgayhardt
2020

2121
Model leaderboards (preview) in Azure AI Foundry portal allow you to streamline the model selection process in the Azure AI Foundry [model catalog](../how-to/model-catalog-overview.md). The model leaderboards, backed by industry-standard benchmarks can help you to find the best model for your custom AI solution. From the model leaderboards section of the model catalog, you can [browse leaderboards](https://aka.ms/model-leaderboards) to compare available models as follows:
2222

23-
- **Quality, cost, and performance leaderboards** to quickly identify the model leaders along a single metric (quality, cost, or throughput);
24-
- **Trade-off charts** to see how models perform on one metric versus another, such as quality versus cost;
25-
- **Leaderboards by scenario** to find the best leaderboards that suite your scenario.
23+
- [Quality, cost, and performance leaderboards](../how-to/benchmark-model-in-catalog.md#access-model-leaderboards) to quickly identify the model leaders along a single metric (quality, cost, or throughput);
24+
- [Trade-off charts](../how-to/benchmark-model-in-catalog.md#compare-models-in-the-trade-off-charts) to see how models perform on one metric versus another, such as quality versus cost;
25+
- [Leaderboards by scenario](../how-to/benchmark-model-in-catalog.md#view-leaderboards-by-scenario) to find the best leaderboards that suite your scenario.
2626

2727
Whenever you find a model to your liking, you can select it and zoom into the **Detailed benchmarking results** of the model within the model catalog. If satisfied with the model, you can deploy it, try it in the playground, or evaluate it on your data. The leaderboards support benchmarking across text language models (large language models (LLMs) and small language models (SLMs)) and embedding models.
2828

2929

30-
## Benchmarking of large and small language models
30+
Model benchmarks assess LLMs and SLMs across the following categories: quality, performance, and cost. In addition, we assess the quality of embedding models using standard benchmarks. The benchmarks are updated regularly as better and more unsaturated datasets and associated metrics are added to existing models, and as new models are added to the model catalog.
3131

32-
Model benchmarks assess LLMs and SLMs across the following categories: quality, performance, and cost. The benchmarks are updated regularly as new datasets and associated metrics are added to existing models, and as new models are added to the model catalog.
3332

34-
### Quality
33+
## Quality benchmarks of language models
3534

3635
Azure AI assesses the quality of LLMs and SLMs using accuracy scores from standard, comprehensive benchmark datasets measuring model capabilities such as reasoning, knowledge, question answering, math, and coding.
3736

@@ -67,7 +66,14 @@ See more details in accuracy scores:
6766
Accuracy scores are provided on a scale of zero to one. Higher values are better.
6867

6968

70-
### Performance
69+
## Safety benchmarks of language models
70+
71+
Safety benchmarks use a standard metric Attack Success Rate to measure how vulerable language models are to attacks in biosecurity, cybersecurity, and chemical security. Currently, the [Weapons of Mass Destruction Proxy (WMDP) benchmark](https://www.wmdp.ai/) is used to assess hazardous knowledge in language models. The lower the Attack Success Rate is, the safer is the model response.
72+
73+
All model endpoints are benchmarked with the default Azure AI Content Safety filters on with a default configuration. These safety filters detect and block [content harm categories](../../ai-services/content-safety/concepts/harm-categories.md) in violence, self-harm, sexual, hate and unfaireness, but do not measure categories in cybersecurity, biosecurity, chemical security.
74+
75+
76+
## Performance benchmarks of language models
7177

7278
Performance metrics are calculated as an aggregate over 14 days, based on 24 trails (two requests per trail) sent daily with a one-hour interval between every trail. The following default parameters are used for each request to the model endpoint:
7379

@@ -109,7 +115,7 @@ Azure AI also displays performance indexes for latency and throughput as follows
109115

110116
For performance metrics like latency or throughput, the time to first token and the generated tokens per second give a better overall sense of the typical performance and behavior of the model. We refresh our performance numbers on regular cadence.
111117

112-
### Cost
118+
## Cost benchmarks of language models
113119

114120
Cost calculations are estimates for using an LLM or SLM model endpoint hosted on the Azure AI platform. Azure AI supports displaying the cost of serverless APIs and Azure OpenAI models. Because these costs are subject to change, we refresh our cost calculations on a regular cadence.
115121

@@ -127,13 +133,11 @@ Azure AI also displays the cost index as follows:
127133
|-------|-------------|
128134
| Cost index | Estimated cost. Lower values are better. |
129135

130-
## Benchmarking of embedding models
131-
132-
Model benchmarks assess embedding models based on quality.
136+
## Quality benchmarks of embedding models
133137

134-
### Quality
138+
The quality index of embedding models is defined as the averaged accuracy scores of a comprehensive set of standard benchmark datasests targeting Information Retrieval, Document Clustering, and Summarization tasks.
135139

136-
The quality of embedding models is assessed across the following metrics:
140+
See more details in accuracy score definitions specific to each dataset:
137141

138142
| Metric | Description |
139143
|--------|-------------|
@@ -145,9 +149,9 @@ The quality of embedding models is assessed across the following metrics:
145149
| Spearman correlation | Spearman correlation based on cosine similarity is calculated by first computing the cosine similarity between variables, then ranking these scores and using the ranks to compute the Spearman correlation. |
146150
| V measure | V measure is a metric used to evaluate the quality of clustering. V measure is calculated as a harmonic mean of homogeneity and completeness, ensuring a balance between the two for a meaningful score. Possible scores lie between zero and one, with one being perfectly complete labeling. |
147151

148-
### Calculation of scores
152+
## Calculation of scores
149153

150-
#### Individual scores
154+
### Individual scores
151155

152156
Benchmark results originate from public datasets that are commonly used for language model evaluation. In most cases, the data is hosted in GitHub repositories maintained by the creators or curators of the data. Azure AI evaluation pipelines download data from their original sources, extract prompts from each example row, generate model responses, and then compute relevant accuracy metrics.
153157

articles/ai-foundry/how-to/benchmark-model-in-catalog.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ In this article, you learn to streamline your model selection process in the Azu
3535

3636
[!INCLUDE [open-catalog](../includes/open-catalog.md)]
3737

38-
4. Go to the **Model leaderboards** section of the model catalog. This section displays the top three model leaders ranked along [quality](../concepts/model-benchmarks.md#quality), [cost](../concepts/model-benchmarks.md#cost), and [performance](../concepts/model-benchmarks.md#performance). You can select any of these models to check out more details.
38+
4. Go to the **Model leaderboards** section of the model catalog. This section displays the top three model leaders ranked along [quality](../concepts/model-benchmarks.md#quality-benchmarks-of-language-models), [cost](../concepts/model-benchmarks.md#cost-benchmarks-of-language-models), and [performance](../concepts/model-benchmarks.md#performance-benchmarks-of-language-models). You can select any of these models to check out more details.
3939

4040
:::image type="content" source="../media/how-to/model-benchmarks/leaderboard-entry-select-model.png" alt-text="Screenshot showing the selected model from entry point of leaderboards on the model catalog homepage." lightbox="../media/how-to/model-benchmarks/leaderboard-entry-select-model.png":::
4141

0 commit comments

Comments
 (0)