Skip to content

Commit a06436b

Browse files
committed
Learn Editor: Update latency.md
1 parent 95bdb11 commit a06436b

File tree

1 file changed

+6
-6
lines changed

1 file changed

+6
-6
lines changed

articles/ai-services/openai/how-to/latency.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -61,15 +61,15 @@ Assuming all requests for a given workload are uniform, the prompt tokens and co
6161
##### How to use system level throughput estimates
6262

6363

64-
Once system level throughput has been estimated for a given workload, these estimates can be used to size Standard and Provisioned deployments.
64+
Once system level throughput has been estimated for a given workload, these estimates can be used to size Standard and Provisioned deployments. For Standard deployments, the input and output TPM values can be combined to estimate the total TPM to be assigned to a given deployment. For Provisioned deployments, the request token usage data (for the dedicated capacity calculator experience) or input and output TPM values (for the deployment capacity calculator experience) can be used to estimate the number of PTUs required to support a given workload.
6565

6666
Here are a few examples for GPT-4o mini model:
6767

68-
| Prompt Size (tokens) | Generation size (tokens) | Requests per minute | PTUs required |
69-
|--|--|--|--|
70-
| 800 | 150 | 30 | 100 |
71-
| 1000 | 50 | 300 | 700 |
72-
| 5000 | 100 | 50 | 600 |
68+
| Prompt Size (tokens) |Generation size (tokens) |Requests per minute |Input TPM|Output TPM|PTUs required |
69+
|--|--|--| -------- | -------- |--|
70+
|800 |150 |30 |24,000|4,500|15|
71+
|5,000 |50 |1,000|5,000,000|50,000|140|
72+
|1,000 |300 | 500 |500,000|150,000|30|
7373

7474
The number of PTUs scales roughly linearly with call rate (might be sublinear) when the workload distribution remains constant.
7575

0 commit comments

Comments
 (0)