|
| 1 | +--- |
| 2 | +title: Explore model benchmarks in Azure AI Studio |
| 3 | +titleSuffix: Azure AI Studio |
| 4 | +description: This article introduces benchmarking capabilities and the model benchmarks experience in Azure AI Studio. |
| 5 | +manager: scottpolly |
| 6 | +ms.service: azure-ai-studio |
| 7 | +ms.custom: |
| 8 | +ms.topic: how-to |
| 9 | +ms.date: 5/6/2024 |
| 10 | +ms.reviewer: jcioffi |
| 11 | +ms.author: jcioffi |
| 12 | +author: jesscioffi |
| 13 | +--- |
| 14 | + |
| 15 | +# Model benchmarks |
| 16 | + |
| 17 | +[!INCLUDE [Azure AI Studio preview](../includes/preview-ai-studio.md)] |
| 18 | + |
| 19 | +In Azure AI Studio, you can compare benchmarks across models and datasets available in the industry to assess which one meets your business scenario. You can find Model benchmarks under **Get started** in the left side menu in Azure AI Studio. |
| 20 | + |
| 21 | +:::image type="content" source="../media/explore/model-benchmarks-dashboard-view.png" alt-text="Screenshot of dashboard view graph of model accuracy." lightbox="../media/explore/model-benchmarks-dashboard-view.png"::: |
| 22 | + |
| 23 | +Model benchmarks help you make informed decisions about the sustainability of models and datasets prior to initiating any job. The benchmarks are a curated list of the best performing models for a given task, based on a comprehensive comparison of benchmarking metrics. Currently, Azure AI Studio provides benchmarks based on quality, via the metrics listed below. |
| 24 | + |
| 25 | +| Metric | Description | |
| 26 | +|--------------|-------| |
| 27 | +| Accuracy |Accuracy scores are available at the dataset and the model levels. At the dataset level, the score is the average value of an accuracy metric computed over all examples in the dataset. The accuracy metric used is exact-match in all cases except for the *HumanEval* dataset that uses a `pass@1` metric. Exact match simply compares model generated text with the correct answer according to the dataset, reporting one if the generated text matches the answer exactly and zero otherwise. `Pass@1` measures the proportion of model solutions that pass a set of unit tests in a code generation task. At the model level, the accuracy score is the average of the dataset-level accuracies for each model.| |
| 28 | +| Coherence |Coherence evaluates how well the language model can produce output that flows smoothly, reads naturally, and resembles human-like language.| |
| 29 | +| Fluency |Fluency evaluates the language proficiency of a generative AI's predicted answer. It assesses how well the generated text adheres to grammatical rules, syntactic structures, and appropriate usage of vocabulary, resulting in linguistically correct and natural-sounding responses.| |
| 30 | +| GPTSimilarity|GPTSimilarity is a measure that quantifies the similarity between a ground truth sentence (or document) and the prediction sentence generated by an AI model. It is calculated by first computing sentence-level embeddings using the embeddings API for both the ground truth and the model's prediction. These embeddings represent high-dimensional vector representations of the sentences, capturing their semantic meaning and context.| |
| 31 | + |
| 32 | +The benchmarks are updated regularly as new metrics and datasets are added to existing models, and as new models are added to the model catalog. |
| 33 | + |
| 34 | +### How the scores are calculated |
| 35 | + |
| 36 | +The benchmark results originate from public datasets that are commonly used for language model evaluation. In most cases, the data is hosted in GitHub repositories maintained by the creators or curators of the data. Azure AI evaluation pipelines download data from their original sources, extract prompts from each example row, generate model responses, and then compute relevant accuracy metrics. |
| 37 | + |
| 38 | +Prompt construction follows best practice for each dataset, set forth by the paper introducing the dataset and industry standard. In most cases, each prompt contains several examples of complete questions and answers, or "shots," to prime the model for the task. The evaluation pipelines create shots by sampling questions and answers from a portion of the data that is held out from evaluation. |
| 39 | + |
| 40 | +### View options in the model benchmarks |
| 41 | + |
| 42 | +These benchmarks encompass both a dashboard view and a list view of the data for ease of comparison, and helpful information that explains what the calculated metrics mean. |
| 43 | + |
| 44 | +Dashboard view allows you to compare the scores of multiple models across datasets and tasks. You can view models side by side (horizontally along the x-axis) and compare their scores (vertically along the y-axis) for each metric. |
| 45 | + |
| 46 | +You can filter the dashboard view by task, model collection, model name, dataset, and metric. |
| 47 | + |
| 48 | +You can switch from dashboard view to list view by following these quick steps: |
| 49 | +1. Select the models you want to compare. |
| 50 | +2. Select **List** on the right side of the page. |
| 51 | + |
| 52 | +:::image type="content" source="../media/explore/model-benchmarks-dashboard-filtered.png" alt-text="Screenshot of dashboard view graph with question answering filter applied and 'List' button identified." lightbox="../media/explore/model-benchmarks-dashboard-filtered.png"::: |
| 53 | + |
| 54 | +In list view you can find the following information: |
| 55 | +- Model name, description, version, and aggregate scores. |
| 56 | +- Benchmark datasets (such as AGIEval) and tasks (such as question answering) that were used to evaluate the model. |
| 57 | +- Model scores per dataset. |
| 58 | + |
| 59 | +You can also filter the list view by task, model collection, model name, dataset, and metric. |
| 60 | + |
| 61 | +:::image type="content" source="../media/explore/model-benchmarks-list-view.png" alt-text="Screenshot of list view table displaying accuracy metrics in an ordered list." lightbox="../media/explore/model-benchmarks-list-view.png"::: |
| 62 | + |
| 63 | +## Next steps |
| 64 | + |
| 65 | +- [Explore Azure AI foundation models in Azure AI Studio](models-foundation-azure-ai.md) |
| 66 | +- [View and compare benchmarks in AI Studio](https://ai.azure.com/explore/benchmarks) |
0 commit comments