Skip to content

Commit 9086c39

Browse files
authored
Merge pull request #275664 from s-polly/stp-ai-studio-follow
Update to model benchmarks for Build
2 parents fb99fda + 7a6b521 commit 9086c39

File tree

4 files changed

+30
-3
lines changed

4 files changed

+30
-3
lines changed

articles/ai-studio/how-to/model-benchmarks.md

Lines changed: 30 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,22 +20,49 @@ In Azure AI Studio, you can compare benchmarks across models and datasets availa
2020

2121
:::image type="content" source="../media/explore/model-benchmarks-dashboard-view.png" alt-text="Screenshot of dashboard view graph of model accuracy." lightbox="../media/explore/model-benchmarks-dashboard-view.png":::
2222

23-
Model benchmarks help you make informed decisions about the sustainability of models and datasets prior to initiating any job. The benchmarks are a curated list of the best performing models for a given task, based on a comprehensive comparison of benchmarking metrics. Currently, Azure AI Studio provides benchmarks based on quality, via the metrics listed below.
23+
Model benchmarks help you make informed decisions about the sustainability of models and datasets before initiating any job. The benchmarks are a curated list of the best performing models for a given task, based on a comprehensive comparison of benchmarking metrics. Currently, Azure AI Studio provides benchmarking across the following types of models based on our model catalog collections:
24+
25+
- Benchmarks across LLMs and SLMs
26+
- Benchmarks across embeddings models
27+
28+
You can switch between the **Quality benchmarks** and **Embeddings benchmarks** views by clicking on the corresponding tabs within the model benchmarks experience in AI Studio.
29+
30+
31+
### Benchmarking of LLMs and SLMs
32+
33+
Model benchmarks assess the quality of LLMs and SLMs across various metrics listed below:
2434

2535
| Metric | Description |
2636
|--------------|-------|
2737
| Accuracy |Accuracy scores are available at the dataset and the model levels. At the dataset level, the score is the average value of an accuracy metric computed over all examples in the dataset. The accuracy metric used is exact-match in all cases except for the *HumanEval* dataset that uses a `pass@1` metric. Exact match simply compares model generated text with the correct answer according to the dataset, reporting one if the generated text matches the answer exactly and zero otherwise. `Pass@1` measures the proportion of model solutions that pass a set of unit tests in a code generation task. At the model level, the accuracy score is the average of the dataset-level accuracies for each model.|
2838
| Coherence |Coherence evaluates how well the language model can produce output that flows smoothly, reads naturally, and resembles human-like language.|
2939
| Fluency |Fluency evaluates the language proficiency of a generative AI's predicted answer. It assesses how well the generated text adheres to grammatical rules, syntactic structures, and appropriate usage of vocabulary, resulting in linguistically correct and natural-sounding responses.|
30-
| GPTSimilarity|GPTSimilarity is a measure that quantifies the similarity between a ground truth sentence (or document) and the prediction sentence generated by an AI model. It is calculated by first computing sentence-level embeddings using the embeddings API for both the ground truth and the model's prediction. These embeddings represent high-dimensional vector representations of the sentences, capturing their semantic meaning and context.|
40+
| GPTSimilarity|GPTSimilarity is a measure that quantifies the similarity between a ground truth sentence (or document) and the prediction sentence generated by an AI model. It's calculated by first computing sentence-level embeddings using the embeddings API for both the ground truth and the model's prediction. These embeddings represent high-dimensional vector representations of the sentences, capturing their semantic meaning and context.|
41+
|Groundedness| Groundedness measures how well the language model's generated answers align with information from the input source. |
42+
|Relevance| Relevance measures the extent to which the language model's generated responses are pertinent and directly related to the given questions. |
3143

3244
The benchmarks are updated regularly as new metrics and datasets are added to existing models, and as new models are added to the model catalog.
3345

46+
### Benchmarking of embedding models
47+
48+
Model benchmarks assess embeddings models across various metrics listed in the table:
49+
50+
51+
|Metric |Description |
52+
|---------|---------|
53+
|Accuracy | Accuracy is the proportion of correct predictions among the total number of predictions processed. |
54+
|F1 Score | F1 Score is the weighted mean of the precision and recall, where the best value is 1 (perfect precision and recall), and worst is 0. |
55+
|Mean Average Precision (MAP) |Mean Average Precision (MAP) evaluates the quality of ranking and recommender systems. It measures both the relevance of suggested items and how good the system is at placing more relevant items at the top. Values can range from 0 to 1. The higher the MAP, the better the system can place relevant items high in the list. |
56+
|Normalized Discounted Cumulative Gain (NDCG) |Normalized Discounted Cumulative Gain evaluates a machine learning algorithm's ability to sort items based on relevance. It compares rankings to an ideal order where all relevant items are at the top of the list, where k is the list length while evaluating ranking quality. In our benchmarks, k=10, indicated by a metrics of ndcg_at_10, meaning that we look at the top 10 items. |
57+
|Precision |Precision measures the model's ability to identify instances of a particular class correctly. Precision shows how often an ML model is correct when predicting the target class. |
58+
|Spearman Correlation | Spearman Correlation is based on cosine similarity. It is calculated by first computing the cosine similarity between variables, then ranking these scores and using the ranks to compute the Spearman Correlation. |
59+
|V-measure | V-measure is a metric used to evaluate the quality of clustering. It's calculated as a harmonic mean of homogeneity and completeness, ensuring a balance between the two for a meaningful score. Possible scores lie between 0 and 1, with 1 being perfectly complete labeling. |
60+
3461
### How the scores are calculated
3562

3663
The benchmark results originate from public datasets that are commonly used for language model evaluation. In most cases, the data is hosted in GitHub repositories maintained by the creators or curators of the data. Azure AI evaluation pipelines download data from their original sources, extract prompts from each example row, generate model responses, and then compute relevant accuracy metrics.
3764

38-
Prompt construction follows best practice for each dataset, set forth by the paper introducing the dataset and industry standard. In most cases, each prompt contains several examples of complete questions and answers, or "shots," to prime the model for the task. The evaluation pipelines create shots by sampling questions and answers from a portion of the data that is held out from evaluation.
65+
Prompt construction follows best practices for each dataset, defined by the paper introducing the dataset and industry standard. In most cases, each prompt contains several examples of complete questions and answers, or "shots," to prime the model for the task. The evaluation pipelines create shots by sampling questions and answers from a portion of the data that is held out from evaluation.
3966

4067
### View options in the model benchmarks
4168

-20.6 KB
Loading
15.6 KB
Loading
-40.1 KB
Loading

0 commit comments

Comments
 (0)