Skip to content

Commit 391bfc6

Browse files
Merge pull request #1521 from msakande/ignite-benchmarking-updates
Ignite benchmarking updates
2 parents 0166710 + 6589024 commit 391bfc6

File tree

1 file changed

+23
-48
lines changed

1 file changed

+23
-48
lines changed

articles/ai-studio/concepts/model-benchmarks.md

Lines changed: 23 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ ms.service: azure-ai-studio
77
ms.custom:
88
- ai-learning-hub
99
ms.topic: concept-article
10-
ms.date: 10/29/2024
10+
ms.date: 11/11/2024
1111
ms.reviewer: jcioffi
1212
ms.author: mopeakande
1313
author: msakande
@@ -56,50 +56,34 @@ Azure AI also displays the quality index as follows:
5656

5757
| Index | Description |
5858
|-------|-------------|
59-
| Quality index | GPTSimilarity scaled down from zero to one, averaged with our accuracy metrics. A higher quality index value is better. |
59+
| Quality index | Quality index is calculated by scaling down GPTSimilarity between zero and one, followed by averaging with accuracy metrics. Higher values of quality index are better. |
6060

61-
Azure AI assesses the quality index by using both the measurement of accuracy and GPTSimilarity as the prompt assisted metric. The stability of the GPTSimilarity metric averaging with the accuracy of the model provides an indicator of the overall quality of the model.
61+
The quality index represents the average score of the applicable primary metric (accuracy, rescaled GPTSimilarity) over 15 standard datasets and is provided on a scale of zero to one.
6262

63-
### Performance
63+
Quality index constitutes two categories of metrics:
6464

65-
To assess performance, Azure AI uses two approaches.
65+
- Accuracy (for example, exact match or `pass@k`). Ranges from zero to one.
66+
- Prompt-based metrics (for example, GPTSimilarity, groundedness, coherence, fluency, and relevance). Ranges from one to five.
6667

67-
- Streaming response returns a chunk of one or more tokens.
68-
- Performance metrics are calculated as an aggregate.
68+
The stability of the quality index value provides an indicator of the overall quality of the model.
6969

70-
#### Streaming response returns a chunk of one or more tokens
70+
### Performance
7171

72-
This approach uses the following default parameters for benchmarking:
72+
Performance metrics are calculated as an aggregate over 14 days, based on 24 trails (two requests per trail) sent daily with a one-hour interval between every trail. The following default parameters are used for each request to the model endpoint:
7373

74-
| Parameter | Value | Applies to |
74+
| Parameter | Value | Applicable For |
7575
|-----------|-------|----------------|
7676
| Region | East US/East US2 | [Serverless APIs](../how-to/model-catalog-overview.md#serverless-api-pay-per-token-billing) and [Azure OpenAI](/azure/ai-services/openai/overview) |
7777
| Tokens per minute (TPM) rate limit | 30k (180 RPM based on Azure OpenAI) <br> N/A (serverless APIs) | For Azure OpenAI models, selection is available for users with rate limit ranges based on deployment type (standard, global, global standard, and so on.) <br> For serverless APIs, this setting is abstracted. |
78-
| Number of requests | 128 | Serverless APIs, Azure OpenAI |
78+
| Number of requests | Two requests in a trail for every hour (24 trails per day) | Serverless APIs, Azure OpenAI |
79+
| Number of trails/runs | 14 days with 24 trails per day for 336 runs | Serverless APIs, Azure OpenAI |
7980
| Prompt/Context length | Moderate length | Serverless APIs, Azure OpenAI |
8081
| Number of tokens processed (moderate) | 80:20 ratio for input to output tokens, that is, 800 input tokens to 200 output tokens. | Serverless APIs, Azure OpenAI |
81-
| Number of concurrent requests | 16 | Serverless APIs, Azure OpenAI |
82+
| Number of concurrent requests | One (requests are sent sequentially one after other) | Serverless APIs, Azure OpenAI |
8283
| Data | Synthetic (input prompts prepared from static text) | Serverless APIs, Azure OpenAI |
83-
| Deployment type | Standard | Applicable only for Azure OpenAI |
84-
| Streaming | True | Applies to serverless APIs and Azure OpenAI. For models deployed via [managed compute](../how-to/model-catalog-overview.md#managed-compute), set max_token = 1 to replicate streaming scenario, which allows for calculating metrics like total time to first token (TTFT) for managed compute. |
85-
| Tokenizer | Tiktoken package (Azure OpenAI) <br> Hugging Face model ID (Serverless APIs) | Hugging Face model ID (Azure serverless APIs) |
86-
87-
#### Performance metrics calculated as an aggregate
88-
89-
For this approach, performance metrics are calculated as an aggregate over 14 days, based on 24 trials, with two requests per trial. These requests are sent daily with a one-hour interval between every trial. Each request to the model endpoint uses the following default parameters:
90-
91-
| Parameter | Value | Applies to |
92-
|-----------|-------|----------------|
9384
| Region | East US/East US2 | Serverless APIs and Azure OpenAI |
94-
| Tokens per minute (TPM) rate limit | 30k (180 RPM based on Azure OpenAI) <br> N/A (serverless APIs) | For Azure OpenAI models, selection is available for users with rate limit ranges based on deployment type (standard, global, global standard, and so on.) <br> For serverless APIs, this setting is abstracted. |
95-
| Number of requests | Two requests in a trial for every hour (24 trials per day) | Serverless APIs and Azure OpenAI |
96-
| Number of trials/runs | 14 days * 24 trials = 336 | Serverless APIs and Azure OpenAI |
97-
| Prompt/Context length | Moderate length | Serverless APIs and Azure OpenAI |
98-
| Number of tokens processed | 80:20 ratio for input to output tokens, that is, 800 input tokens to 200 output tokens. | Serverless APIs and Azure OpenAI |
99-
| Number of concurrent requests | One (requests are sent sequentially one after another) | Serverless APIs and Azure OpenAI |
100-
| Data | Synthetic (Input prompts prepared from static text) | Serverless APIs, Azure OpenAI, and managed compute |
10185
| Deployment type | Standard | Applicable only for Azure OpenAI |
102-
| Streaming | True | Applicable for serverless APIs and Azure OpenAI. For models deployed via managed compute, set max_token = 1 to replicate streaming scenario, which allows for calculating metrics like total time to first token (TTFT) for managed compute. |
86+
| Streaming | True | Applies to serverless APIs and Azure OpenAI. For models deployed via [managed compute](../how-to/model-catalog-overview.md#managed-compute), set max_token = 1 to replicate streaming scenario, which allows for calculating metrics like total time to first token (TTFT) for managed compute. |
10387
| Tokenizer | Tiktoken package (Azure OpenAI) <br> Hugging Face model ID (Serverless APIs) | Hugging Face model ID (Azure serverless APIs) |
10488

10589
The performance of LLMs and SLMs is assessed across the following metrics:
@@ -116,32 +100,32 @@ The performance of LLMs and SLMs is assessed across the following metrics:
116100
| Latency TTFT | Total time to first token (TTFT) is the time taken for the first token in the response to be returned from the endpoint when streaming is enabled. |
117101
| Time between tokens | This metric is the time between tokens received. |
118102

119-
Azure AI also displays the latency and throughput indexes as follows:
103+
Azure AI also displays performance indexes for latency and throughput as follows:
120104

121105
| Index | Description |
122106
|-------|-------------|
123-
| Latency index | Latency TTFT. Time to first token. Lower values are better. |
124-
| Throughput index | Throughput GTPS. Generated tokens per second. Higher values are better. |
107+
| Latency index | Mean time to first token. Lower values are better. |
108+
| Throughput index | Mean generated tokens per second. Higher values are better. |
125109

126-
Performance metrics like latency and throughput are measured over time. For these indexes, Azure AI uses the TTFT and the GTPS as indexes. These measurements indicate how quick a model performs and how much data it can process.
110+
For performance metrics like latency or throughput, the time to first token and the generated tokens per second give a better overall sense of the typical performance and behavior of the model. We refresh our performance numbers on regular cadence.
127111

128112
### Cost
129113

130-
Cost calculations are estimates for using an LLM or SLM model endpoint hosted on the Azure AI platform. Azure AI supports displaying the cost of serverless APIs and Azure OpenAI models. Because these costs are subject to change, we refresh our cost calculations two times a week.
114+
Cost calculations are estimates for using an LLM or SLM model endpoint hosted on the Azure AI platform. Azure AI supports displaying the cost of serverless APIs and Azure OpenAI models. Because these costs are subject to change, we refresh our cost calculations on a regular cadence.
131115

132116
The cost of LLMs and SLMs is assessed across the following metrics:
133117

134118
| Metric | Description |
135119
|--------|-------------|
136-
| Cost per input tokens | Price for serverless API deployment for 1 million input tokens |
137-
| Cost per output tokens | Price for serverless API deployment for 1 million output tokens |
138-
| Total price | Price for the sum of cost per input tokens and cost per output tokens, with a ratio of 3:1. |
120+
| Cost per input tokens | Cost for serverless API deployment for 1 million input tokens |
121+
| Cost per output tokens | Cost for serverless API deployment for 1 million output tokens |
122+
| Estimated cost | Cost for the sum of cost per input tokens and cost per output tokens, with a ratio of 3:1. |
139123

140124
Azure AI also displays the cost index as follows:
141125

142126
| Index | Description |
143127
|-------|-------------|
144-
| Cost index | Total price. Lower values are better. |
128+
| Cost index | Estimated cost. Lower values are better. |
145129

146130
## Benchmarking of embedding models
147131

@@ -169,15 +153,6 @@ Benchmark results originate from public datasets that are commonly used for lang
169153

170154
Prompt construction follows best practices for each dataset, as specified by the paper introducing the dataset and industry standards. In most cases, each prompt contains several _shots_, that is, several examples of complete questions and answers to prime the model for the task. The evaluation pipelines create shots by sampling questions and answers from a portion of the data that's held out from evaluation.
171155

172-
#### Indexes
173-
174-
The different indexes are calculated as follows:
175-
176-
- **Quality index:** This index is the average score across our accuracy measurement and our GPTSimilarity measurement (brought to a scale from zero to one, using min-max scaling). The index is provided on a scale from zero to one.
177-
- **Performance index:** This index is the median latency across all the values that we measure.
178-
- **Cost index:** This index is the total price for the cost per input tokens and cost per output tokens, with a ratio of 3:1.
179-
180-
181156
## Related content
182157

183158
- [How to benchmark models in Azure AI Studio](../how-to/benchmark-model-in-catalog.md)

0 commit comments

Comments
 (0)