You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-studio/concepts/model-benchmarks.md
+29-26Lines changed: 29 additions & 26 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -75,34 +75,34 @@ This approach uses the following default parameters for benchmarking:
75
75
76
76
| Parameter | Value | Applies to |
77
77
|-----------|-------|----------------|
78
-
| Region | East US/East US2 | Serverless API model deployments and Azure OpenAI |
79
-
| Tokens per minute (TPM) rate limit | 30k (180 RPM based on Azure OpenAI) | N/A (Azure serverless API model deployments)|
80
-
| Number of requests | 128 | Serverless API model deployments, Azure OpenAI |
81
-
| Prompt/Context length | Moderate length | Serverless API model deployments, Azure OpenAI |
82
-
| Number of tokens processed (moderate) | 80:20 ratio for input to output tokens, that is, 800 input tokens to 200 output tokens. | Serverless API model deployments, Azure OpenAI |
83
-
| Number of concurrent requests | 16 | Serverless API model deployments, Azure OpenAI |
84
-
| Data | Synthetic (I/p prompts prepared from static text) | Serverless API model deployments, Azure OpenAI |
78
+
| Region | East US/East US2 | Serverless APIs and Azure OpenAI |
79
+
| Tokens per minute (TPM) rate limit | 30k (180 RPM based on Azure OpenAI) <br> N/A (serverless APIs) | For Azure OpenAI models, selection is available for users with rate limit ranges based on deployment type (standard, global, global standard, and so on.) <br> For serverless APIs, this setting is abstracted.|
80
+
| Number of requests | 128 | Serverless APIs, Azure OpenAI |
| Number of tokens processed (moderate) | 80:20 ratio for input to output tokens, that is, 800 input tokens to 200 output tokens. | Serverless APIs, Azure OpenAI |
83
+
| Number of concurrent requests | 16 | Serverless APIs, Azure OpenAI |
84
+
| Data | Synthetic (input prompts prepared from static text) | Serverless APIs, Azure OpenAI |
85
85
| Deployment type | Standard | Applicable only for Azure OpenAI |
86
-
| Streaming | True | Applies to serverless API model deployments and Azure OpenAI. For models deployed via managed compute, set max_token = 1 to replicate streaming scenario, which allows for calculating metrics like total time to first token (TTFT) for models deployed via managed compute. |
87
-
| Tokenizer | Tiktoken package (Azure OpenAI) | Hugging Face model ID (Azure serverless API model deployments) |
86
+
| Streaming | True | Applies to serverless APIs and Azure OpenAI. For models deployed via managed compute, set max_token = 1 to replicate streaming scenario, which allows for calculating metrics like total time to first token (TTFT) for managed compute. |
87
+
| Tokenizer | Tiktoken package (Azure OpenAI) <br> Hugging Face model ID (Serverless APIs) | Hugging Face model ID (Azure serverless APIs) |
88
88
89
89
#### Performance metrics calculated as an aggregate
90
90
91
91
For this approach, performance metrics are calculated as an aggregate over 14 days, based on 24 trials, with two requests per trial. These requests are sent daily with a one-hour interval between every trial. Each request to the model endpoint uses the following default parameters:
92
92
93
93
| Parameter | Value | Applies to |
94
94
|-----------|-------|----------------|
95
-
| Region | East US/East US2 | Serverless API model deployments and Azure OpenAI |
96
-
| Tokens per minute (TPM) rate limit | 30k (180 RPM based on Azure OpenAI) | N/A (Azure serverless API model deployments)|
97
-
| Number of requests | Two requests in a trial for every hour (24 trials per day) | Serverless API model deployments and Azure OpenAI |
98
-
| Number of trials/runs | 14 days * 24 trials = 336 | Serverless API model deployments and Azure OpenAI |
99
-
| Prompt/Context length | Moderate length | Serverless API model deployments and Azure OpenAI |
100
-
| Number of tokens processed | 80:20 ratio for input to output tokens, that is, 800 input tokens to 200 output tokens. | Serverless API model deployments, Azure OpenAI |
101
-
| Number of concurrent requests | One (requests are sent sequentially one after another) | Serverless API model deployments and Azure OpenAI |
102
-
| Data | Synthetic (I/p prompts prepared from static text) | Serverless API model deployments, Azure OpenAI, and models deployed via managed compute |
95
+
| Region | East US/East US2 | Serverless APIs and Azure OpenAI |
96
+
| Tokens per minute (TPM) rate limit | 30k (180 RPM based on Azure OpenAI) <br> N/A (serverless APIs) | For Azure OpenAI models, selection is available for users with rate limit ranges based on deployment type (standard, global, global standard, and so on.) <br> For serverless APIs, this setting is abstracted.|
97
+
| Number of requests | Two requests in a trial for every hour (24 trials per day) | Serverless APIs and Azure OpenAI |
98
+
| Number of trials/runs | 14 days * 24 trials = 336 | Serverless APIs and Azure OpenAI |
| Number of tokens processed | 80:20 ratio for input to output tokens, that is, 800 input tokens to 200 output tokens. | Serverless APIs and Azure OpenAI |
101
+
| Number of concurrent requests | One (requests are sent sequentially one after another) | Serverless APIs and Azure OpenAI |
102
+
| Data | Synthetic (Input prompts prepared from static text) | Serverless APIs, Azure OpenAI, and managed compute |
103
103
| Deployment type | Standard | Applicable only for Azure OpenAI |
104
-
| Streaming | True | Applicable for serverless API model deployments and Azure OpenAI. For models deployed via managed compute, we must set max_token = 1 to replicate streaming scenario, which allows for calculating metrics like total time to first token (TTFT) for models deployed via managed compute. |
105
-
| Tokenizer | Tiktoken package (Azure OpenAI) | Hugging Face model ID (Azure serverless API model deployments) |
104
+
| Streaming | True | Applicable for serverless APIs and Azure OpenAI. For models deployed via managed compute, set max_token = 1 to replicate streaming scenario, which allows for calculating metrics like total time to first token (TTFT) for managed compute. |
105
+
| Tokenizer | Tiktoken package (Azure OpenAI) <br> Hugging Face model ID (Serverless APIs) | Hugging Face model ID (Azure serverless APIs) |
106
106
107
107
The performance of LLMs and SLMs is assessed across the following metrics:
108
108
@@ -118,23 +118,26 @@ The performance of LLMs and SLMs is assessed across the following metrics:
118
118
| Latency TTFT | Total time to first token (TTFT) is the time taken for the first token in the response to be returned from the endpoint when streaming is enabled. |
119
119
| Time between tokens | This metric is the time between tokens received. |
120
120
121
-
Azure AI also displays the performance index as follows:
121
+
Azure AI also displays the latency and throughput indexes as follows:
122
122
123
123
| Index | Description |
124
124
|-------|-------------|
125
-
| Performance Index | Latency P50. Lower values are better. For performance metrics like latency or throughput, the median is often preferred as an index measurement. Because the median is more robust and less influenced by outliers, it gives a better overall sense of the typical performance and behavior of the model. |
125
+
| Latency index | Latency TTFT. Time to first token. Lower values are better. |
126
+
| Throughput index | Throughput GTPS. Generated tokens per second. Higher values are better. |
127
+
128
+
Performance metrics like latency and throughput are measured over time. For these indexes, Azure AI uses the TTFT and the GTPS as indexes. These measurements indicate how quick a model performs and how much data it can process.
126
129
127
130
### Cost
128
131
129
-
Cost calculations are estimates for using an LLM or SLM model endpoint hosted on the Azure AI platform. Azure AI supports displaying the cost of serverless API model deployments and Azure OpenAI models. Because these costs are subject to change, we refresh our cost calculations on a cadence of TBD.
132
+
Cost calculations are estimates for using an LLM or SLM model endpoint hosted on the Azure AI platform. Azure AI supports displaying the cost of serverless APIs and Azure OpenAI models. Because these costs are subject to change, we refresh our cost calculations two times a week.
130
133
131
134
The cost of LLMs and SLMs is assessed across the following metrics:
132
135
133
136
| Metric | Description |
134
137
|--------|-------------|
135
-
| Cost per input tokens |US Dollar value for serverless API deployment for 1 million input tokens |
136
-
| Cost per output tokens |US Dollar value for serverless API deployment for 1 million output tokens |
137
-
| Total price |US Dollar value for the sum of cost per input tokens and cost per output tokens, with a ratio of 3:1. |
138
+
| Cost per input tokens |Price for serverless API deployment for 1 million input tokens |
139
+
| Cost per output tokens |Price for serverless API deployment for 1 million output tokens |
140
+
| Total price |Price for the sum of cost per input tokens and cost per output tokens, with a ratio of 3:1. |
138
141
139
142
Azure AI also displays the cost index as follows:
140
143
@@ -174,7 +177,7 @@ The different indexes are calculated as follows:
174
177
175
178
-**Quality index:** This index is the average score across our accuracy measurement and our GPTSimilarity measurement (brought to a scale from zero to one, using min-max scaling). The index is provided on a scale from zero to one.
176
179
-**Performance index:** This index is the median latency across all the values that we measure.
177
-
-**Cost index:** This index is the total price in (US Dollars) for the cost per input tokens and cost per output tokens, with a ratio of 3:1.
180
+
-**Cost index:** This index is the total price for the cost per input tokens and cost per output tokens, with a ratio of 3:1.
0 commit comments