Skip to content

Commit c5e0a01

Browse files
committed
Added a section on throughput
1 parent c500e9c commit c5e0a01

File tree

1 file changed

+18
-0
lines changed

1 file changed

+18
-0
lines changed

articles/ai-services/openai/concepts/provisioned-throughput.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,24 @@ An Azure OpenAI Deployment is a unit of management for a specific OpenAI Model.
4343
> [!NOTE]
4444
> The provisioned version of `gpt-4` **Version:** `turbo-2024-04-09` is currently limited to text only.
4545
46+
## How much thoughput you get for each model
47+
The amount of throughput (tokens per minute or TPM) a deployment gets per PTU is a function of the input and output tokens being generated.
48+
49+
Generating output tokens requires more processing and the more tokens generated, the lower the overall TPM per PTU. Provisioned deployments dynamically balance the two, so users do not have to set specific input and output limits. This means the service is resilient to fluctuations in the workload shape.
50+
51+
To help with simplifying the sizing effort, the table below outlines the TPM per PTU for the `gpt-4o` and `gpt-4o-mini` models
52+
53+
| | **gpt-4o**, **2024-05-13** | **gpt-4o**, **2024-08-06** | **gpt-4o-mini**, **2024-07-18** |
54+
| --| --| --|--|
55+
| Deployable Increments | 50 | 50 | 25|
56+
| Input TPM per PTU | 2,500 | 2,500 | 37,000 |
57+
| Output TPM per PTU | 833 | 833 | 12,333 |
58+
| Latency target | > 25 tokens per second* | > 25 tokens per second* | > 25 tokens per second* |
59+
60+
\* Calculated as the average of the per-call average generated tokens on a 1-minute bassis over the month
61+
\** For a full list please see the [AOAI Studio calcualator](https://oai.azure.com/portal/calculator)
62+
63+
4664
## Key concepts
4765

4866
### Deployment types

0 commit comments

Comments
 (0)