Skip to content

Commit c0ddab8

Browse files
Merge pull request #274418 from jesscioffi/main
Creating new model benchmarks documentation with images
2 parents 96e2c59 + 53eaa44 commit c0ddab8

File tree

5 files changed

+68
-0
lines changed

5 files changed

+68
-0
lines changed
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
---
2+
title: Explore model benchmarks in Azure AI Studio
3+
titleSuffix: Azure AI Studio
4+
description: This article introduces benchmarking capabilities and the model benchmarks experience in Azure AI Studio.
5+
manager: scottpolly
6+
ms.service: azure-ai-studio
7+
ms.custom:
8+
ms.topic: how-to
9+
ms.date: 5/6/2024
10+
ms.reviewer: jcioffi
11+
ms.author: jcioffi
12+
author: jesscioffi
13+
---
14+
15+
# Model benchmarks
16+
17+
[!INCLUDE [Azure AI Studio preview](../includes/preview-ai-studio.md)]
18+
19+
In Azure AI Studio, you can compare benchmarks across models and datasets available in the industry to assess which one meets your business scenario. You can find Model benchmarks under **Get started** in the left side menu in Azure AI Studio.
20+
21+
:::image type="content" source="../media/explore/model-benchmarks-dashboard-view.png" alt-text="Screenshot of dashboard view graph of model accuracy." lightbox="../media/explore/model-benchmarks-dashboard-view.png":::
22+
23+
Model benchmarks help you make informed decisions about the sustainability of models and datasets prior to initiating any job. The benchmarks are a curated list of the best performing models for a given task, based on a comprehensive comparison of benchmarking metrics. Currently, Azure AI Studio provides benchmarks based on quality, via the metrics listed below.
24+
25+
| Metric | Description |
26+
|--------------|-------|
27+
| Accuracy |Accuracy scores are available at the dataset and the model levels. At the dataset level, the score is the average value of an accuracy metric computed over all examples in the dataset. The accuracy metric used is exact-match in all cases except for the *HumanEval* dataset that uses a `pass@1` metric. Exact match simply compares model generated text with the correct answer according to the dataset, reporting one if the generated text matches the answer exactly and zero otherwise. `Pass@1` measures the proportion of model solutions that pass a set of unit tests in a code generation task. At the model level, the accuracy score is the average of the dataset-level accuracies for each model.|
28+
| Coherence |Coherence evaluates how well the language model can produce output that flows smoothly, reads naturally, and resembles human-like language.|
29+
| Fluency |Fluency evaluates the language proficiency of a generative AI's predicted answer. It assesses how well the generated text adheres to grammatical rules, syntactic structures, and appropriate usage of vocabulary, resulting in linguistically correct and natural-sounding responses.|
30+
| GPTSimilarity|GPTSimilarity is a measure that quantifies the similarity between a ground truth sentence (or document) and the prediction sentence generated by an AI model. It is calculated by first computing sentence-level embeddings using the embeddings API for both the ground truth and the model's prediction. These embeddings represent high-dimensional vector representations of the sentences, capturing their semantic meaning and context.|
31+
32+
The benchmarks are updated regularly as new metrics and datasets are added to existing models, and as new models are added to the model catalog.
33+
34+
### How the scores are calculated
35+
36+
The benchmark results originate from public datasets that are commonly used for language model evaluation. In most cases, the data is hosted in GitHub repositories maintained by the creators or curators of the data. Azure AI evaluation pipelines download data from their original sources, extract prompts from each example row, generate model responses, and then compute relevant accuracy metrics.
37+
38+
Prompt construction follows best practice for each dataset, set forth by the paper introducing the dataset and industry standard. In most cases, each prompt contains several examples of complete questions and answers, or "shots," to prime the model for the task. The evaluation pipelines create shots by sampling questions and answers from a portion of the data that is held out from evaluation.
39+
40+
### View options in the model benchmarks
41+
42+
These benchmarks encompass both a dashboard view and a list view of the data for ease of comparison, and helpful information that explains what the calculated metrics mean.
43+
44+
Dashboard view allows you to compare the scores of multiple models across datasets and tasks. You can view models side by side (horizontally along the x-axis) and compare their scores (vertically along the y-axis) for each metric.
45+
46+
You can filter the dashboard view by task, model collection, model name, dataset, and metric.
47+
48+
You can switch from dashboard view to list view by following these quick steps:
49+
1. Select the models you want to compare.
50+
2. Select **List** on the right side of the page.
51+
52+
:::image type="content" source="../media/explore/model-benchmarks-dashboard-filtered.png" alt-text="Screenshot of dashboard view graph with question answering filter applied and 'List' button identified." lightbox="../media/explore/model-benchmarks-dashboard-filtered.png":::
53+
54+
In list view you can find the following information:
55+
- Model name, description, version, and aggregate scores.
56+
- Benchmark datasets (such as AGIEval) and tasks (such as question answering) that were used to evaluate the model.
57+
- Model scores per dataset.
58+
59+
You can also filter the list view by task, model collection, model name, dataset, and metric.
60+
61+
:::image type="content" source="../media/explore/model-benchmarks-list-view.png" alt-text="Screenshot of list view table displaying accuracy metrics in an ordered list." lightbox="../media/explore/model-benchmarks-list-view.png":::
62+
63+
## Next steps
64+
65+
- [Explore Azure AI foundation models in Azure AI Studio](models-foundation-azure-ai.md)
66+
- [View and compare benchmarks in AI Studio](https://ai.azure.com/explore/benchmarks)
221 KB
Loading
215 KB
Loading
259 KB
Loading

articles/ai-studio/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,8 @@ items:
4141
items:
4242
- name: Model catalog
4343
href: how-to/model-catalog.md
44+
- name: Model benchmarks
45+
href: how-to/model-benchmarks.md
4446
- name: Cohere models
4547
items:
4648
- name: Deploy Cohere Command models

0 commit comments

Comments
 (0)