Skip to content

Commit 9a49755

Browse files
Merge pull request #966 from sydneemayers/docs-editor/provisioned-throughput-1729624524
updated prompt caching documentation
2 parents 6f577f7 + 0ea3339 commit 9a49755

File tree

2 files changed

+16
-8
lines changed

2 files changed

+16
-8
lines changed

articles/ai-services/openai/concepts/provisioned-throughput.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ An Azure OpenAI Deployment is a unit of management for a specific OpenAI Model.
3535
| Latency | Max latency constrained from the model. Overall latency is a factor of call shape. |
3636
| Utilization | Provisioned-managed Utilization V2 measure provided in Azure Monitor. |
3737
| Estimating size | Provided calculator in the studio & benchmarking script. |
38+
| Prompt caching | For supported models, we discount up to 100% of cached input tokens. |
3839

3940

4041
## How much throughput per PTU you get for each model
@@ -153,12 +154,12 @@ In the Provisioned-Managed and Global Provisioned-Managed offerings, each reques
153154
For Provisioned-Managed and Global Provisioned-Managed, we use a variation of the leaky bucket algorithm to maintain utilization below 100% while allowing some burstiness in the traffic. The high-level logic is as follows:
154155

155156
1. Each customer has a set amount of capacity they can utilize on a deployment
156-
2. When a request is made:
157+
1. When a request is made:
157158

158-
a. When the current utilization is above 100%, the service returns a 429 code with the `retry-after-ms` header set to the time until utilization is below 100%
159-
160-
b. Otherwise, the service estimates the incremental change to utilization required to serve the request by combining prompt tokens and the specified `max_tokens` in the call. If the `max_tokens` parameter is not specified, the service estimates a value. This estimation can lead to lower concurrency than expected when the number of actual generated tokens is small. For highest concurrency, ensure that the `max_tokens` value is as close as possible to the true generation size.
159+
a. When the current utilization is above 100%, the service returns a 429 code with the `retry-after-ms` header set to the time until utilization is below 100%
161160

161+
b. Otherwise, the service estimates the incremental change to utilization required to serve the request by combining prompt tokens and the specified `max_tokens` in the call. For requests that include at least 1024 cached tokens, the cached tokens are subtracted from the prompt token value. A customer can receive up to a 100% discount on their prompt tokens depending on the size of their cached tokens. If the `max_tokens` parameter is not specified, the service estimates a value. This estimation can lead to lower concurrency than expected when the number of actual generated tokens is small. For highest concurrency, ensure that the `max_tokens` value is as close as possible to the true generation size.
162+
162163
3. When a request finishes, we now know the actual compute cost for the call. To ensure an accurate accounting, we correct the utilization using the following logic:
163164

164165
a. If the actual > estimated, then the difference is added to the deployment's utilization

articles/ai-services/openai/how-to/prompt-caching.md

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,18 +14,21 @@ recommendations: false
1414

1515
# Prompt caching
1616

17-
Prompt caching allows you to reduce overall request latency and cost for longer prompts that have identical content at the beginning of the prompt. *"Prompt"* in this context is referring to the input you send to the model as part of your chat completions request. Rather than reprocess the same input tokens over and over again, the model is able to retain a temporary cache of processed input data to improve overall performance. Prompt caching has no impact on the output content returned in the model response beyond a reduction in latency and cost.
17+
Prompt caching allows you to reduce overall request latency and cost for longer prompts that have identical content at the beginning of the prompt. *"Prompt"* in this context is referring to the input you send to the model as part of your chat completions request. Rather than reprocess the same input tokens over and over again, the model is able to retain a temporary cache of processed input data to improve overall performance. Prompt caching has no impact on the output content returned in the model response beyond a reduction in latency and cost. For supported models, cached tokens are billed at a [50% discount on input token pricing](https://azure.microsoft.com/pricing/details/cognitive-services/openai-service/).
1818

1919
## Supported models
2020

2121
Currently only the following models support prompt caching with Azure OpenAI:
2222

2323
- `o1-preview-2024-09-12`
2424
- `o1-mini-2024-09-12`
25+
- `gpt-4o-2024-05-13`
26+
- `gpt-4o-2024-08-06`
27+
- `gpt-4o-mini-2024-07-18`
2528

2629
## API support
2730

28-
Official support for prompt caching was first added in API version `2024-10-01-preview`.
31+
Official support for prompt caching was first added in API version `2024-10-01-preview`. At this time, only `o1-preview-2024-09-12` and `o1-mini-2024-09-12` models support the `cached_tokens` API response parameter.
2932

3033
## Getting started
3134

@@ -67,7 +70,7 @@ A single character difference in the first 1,024 tokens will result in a cache m
6770

6871
The o1-series models are text only and don't support system messages, images, tool use/function calling, or structured outputs. This limits the efficacy of prompt caching for these models to the user/assistant portions of the messages array which are less likely to have an identical 1024 token prefix.
6972

70-
Once prompt caching is enabled for other supported models prompt caching will expand to support:
73+
For `gpt-4o` and `gpt-4o-mini` models, prompt caching is supported for:
7174

7275
| **Caching Supported** | **Description** |
7376
|--------|--------|
@@ -80,4 +83,8 @@ To improve the likelihood of cache hits occurring, you should structure your req
8083

8184
## Can I disable prompt caching?
8285

83-
Prompt caching is enabled by default. There is no opt-out option.
86+
Prompt caching is enabled by default. There is no opt-out option.
87+
88+
## How does prompt caching work for Provisioned deployments?
89+
90+
For supported models on provisioned deployments, we discount up to 100% of cached input tokens. For more information, see our [Provisioned Throughput documentation](/azure/ai-services/openai/concepts/provisioned-throughput).

0 commit comments

Comments
 (0)