Skip to content

Commit 9f75052

Browse files
Merge pull request #264079 from ChrisHMSFT/chrhoder/ptuupdate
Added some details on sizing a PTU
2 parents 4eaa70d + a153cf3 commit 9f75052

File tree

1 file changed

+12
-2
lines changed

1 file changed

+12
-2
lines changed

articles/ai-services/openai/concepts/provisioned-throughput.md

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -62,11 +62,20 @@ az cognitiveservices account deployment create \
6262

6363
Provisioned throughput quota represents a specific amount of total throughput you can deploy. Quota in the Azure OpenAI Service is managed at the subscription level. All Azure OpenAI resources within the subscription share this quota.
6464

65-
Quota is specific to a (deployment type, model, region) triplet and isn't interchangeable. Meaning you can't use quota for GPT-4 to deploy GPT-35-turbo. You can raise a support request to move quota across deployment types, models, or regions but the swap isn't guaranteed.
65+
Quota is specified in Provisioned throughput units and is specific to a (deployment type, model, region) triplet. Quota isn't interchangeable. Meaning you can't use quota for GPT-4 to deploy GPT-35-turbo. You can raise a support request to move quota across deployment types, models, or regions but the swap isn't guaranteed.
6666

6767
While we make every attempt to ensure that quota is deployable, quota doesn't represent a guarantee that the underlying capacity is available. The service assigns capacity during the deployment operation and if capacity is unavailable the deployment fails with an out of capacity error.
6868

6969

70+
### Determining the number of PTUs needed for a workload
71+
72+
PTUs represent an amount of model processing capacity. Similar to your computer or databases, different workloads or requests to the model will consume different amounts of underlying processing capacity. The conversion from call shape characteristics (prompt size, generation size and call rate) to PTUs is complex and non-linear. To simplify this process, you can use the [Azure OpenAI Capacity calculator](https://oai.azure.com/portal/calculator) to size specific workload shapes.
73+
74+
A few high-level considerations:
75+
- Generations require more capacity than prompts
76+
- Larger calls are progressively more expensive to compute. For example, 100 calls of with a 1000 token prompt size will require less capacity than 1 call with 100,000 tokens in the prompt. This also means that the distribution of these call shapes is important in overall throughput. Traffic patterns with a wide distribution that includes some very large calls may experience lower throughput per PTU than a narrower distribution with the same average prompt & completion token sizes.
77+
78+
7079
### How utilization enforcement works
7180
Provisioned deployments provide you with an allocated amount of model processing capacity to run a given model. The `Provisioned-Managed Utilization` metric in Azure Monitor measures a given deployments utilization on 1-minute increments. Provisioned-Managed deployments are optimized to ensure that accepted calls are processed with a consistent model processing time (actual end-to-end latency is dependent on a call's characteristics). When the workload exceeds the allocated PTU capacity, the service returns a 429 HTTP status code until the utilization drops down below 100%.
7281

@@ -94,7 +103,8 @@ We use a variation of the leaky bucket algorithm to maintain utilization below 1
94103

95104
4. The overall utilization is decremented down at a continuous rate based on the number of PTUs deployed.
96105

97-
Since calls are accepted until utilization reaches 100%, you're allowed to burst over 100% utilization when first increasing traffic. For sizeable calls and small sized deployments, you might then be over 100% utilization for up to several minutes.
106+
> [!NOTE]
107+
> Calls are accepted until utilization reaches 100%. Bursts just over 100% maybe permitted in short periods, but over time, your traffic is capped at 100% utilization.
98108
99109

100110
:::image type="content" source="../media/provisioned/utilization.jpg" alt-text="Diagram showing how subsequent calls are added to the utilization." lightbox="../media/provisioned/utilization.jpg":::

0 commit comments

Comments
 (0)