You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| Estimating size | Provided calculator in the studio & benchmarking script. |
38
+
| Prompt caching | For supported models, we discount up to 100% of cached input tokens. |
38
39
39
40
40
41
## How much throughput per PTU you get for each model
@@ -153,12 +154,12 @@ In the Provisioned-Managed and Global Provisioned-Managed offerings, each reques
153
154
For Provisioned-Managed and Global Provisioned-Managed, we use a variation of the leaky bucket algorithm to maintain utilization below 100% while allowing some burstiness in the traffic. The high-level logic is as follows:
154
155
155
156
1. Each customer has a set amount of capacity they can utilize on a deployment
156
-
2. When a request is made:
157
+
1.When a request is made:
157
158
158
-
a. When the current utilization is above 100%, the service returns a 429 code with the `retry-after-ms` header set to the time until utilization is below 100%
159
-
160
-
b. Otherwise, the service estimates the incremental change to utilization required to serve the request by combining prompt tokens and the specified `max_tokens` in the call. If the `max_tokens` parameter is not specified, the service estimates a value. This estimation can lead to lower concurrency than expected when the number of actual generated tokens is small. For highest concurrency, ensure that the `max_tokens` value is as close as possible to the true generation size.
159
+
a. When the current utilization is above 100%, the service returns a 429 code with the `retry-after-ms` header set to the time until utilization is below 100%
161
160
161
+
b. Otherwise, the service estimates the incremental change to utilization required to serve the request by combining prompt tokens and the specified `max_tokens` in the call. For requests that include at least 1024 cached tokens, the cached tokens are subtracted from the prompt token value. A customer can receive up to a 100% discount on their prompt tokens depending on the size of their cached tokens. If the `max_tokens` parameter is not specified, the service estimates a value. This estimation can lead to lower concurrency than expected when the number of actual generated tokens is small. For highest concurrency, ensure that the `max_tokens` value is as close as possible to the true generation size.
162
+
162
163
3. When a request finishes, we now know the actual compute cost for the call. To ensure an accurate accounting, we correct the utilization using the following logic:
163
164
164
165
a. If the actual > estimated, then the difference is added to the deployment's utilization
Copy file name to clipboardExpand all lines: articles/ai-services/openai/how-to/prompt-caching.md
+11-4Lines changed: 11 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,18 +14,21 @@ recommendations: false
14
14
15
15
# Prompt caching
16
16
17
-
Prompt caching allows you to reduce overall request latency and cost for longer prompts that have identical content at the beginning of the prompt. *"Prompt"* in this context is referring to the input you send to the model as part of your chat completions request. Rather than reprocess the same input tokens over and over again, the model is able to retain a temporary cache of processed input data to improve overall performance. Prompt caching has no impact on the output content returned in the model response beyond a reduction in latency and cost.
17
+
Prompt caching allows you to reduce overall request latency and cost for longer prompts that have identical content at the beginning of the prompt. *"Prompt"* in this context is referring to the input you send to the model as part of your chat completions request. Rather than reprocess the same input tokens over and over again, the model is able to retain a temporary cache of processed input data to improve overall performance. Prompt caching has no impact on the output content returned in the model response beyond a reduction in latency and cost. For supported models, cached tokens are billed at a [50% discount on input token pricing](https://azure.microsoft.com/pricing/details/cognitive-services/openai-service/).
18
18
19
19
## Supported models
20
20
21
21
Currently only the following models support prompt caching with Azure OpenAI:
22
22
23
23
-`o1-preview-2024-09-12`
24
24
-`o1-mini-2024-09-12`
25
+
-`gpt-4o-2024-05-13`
26
+
-`gpt-4o-2024-08-06`
27
+
-`gpt-4o-mini-2024-07-18`
25
28
26
29
## API support
27
30
28
-
Official support for prompt caching was first added in API version `2024-10-01-preview`.
31
+
Official support for prompt caching was first added in API version `2024-10-01-preview`. At this time, only `o1-preview-2024-09-12` and `o1-mini-2024-09-12` models support the `cached_tokens` API response parameter.
29
32
30
33
## Getting started
31
34
@@ -67,7 +70,7 @@ A single character difference in the first 1,024 tokens will result in a cache m
67
70
68
71
The o1-series models are text only and don't support system messages, images, tool use/function calling, or structured outputs. This limits the efficacy of prompt caching for these models to the user/assistant portions of the messages array which are less likely to have an identical 1024 token prefix.
69
72
70
-
Once prompt caching is enabled for other supported models prompt caching will expand to support:
73
+
For `gpt-4o` and `gpt-4o-mini`models, prompt caching is supported for:
71
74
72
75
|**Caching Supported**|**Description**|
73
76
|--------|--------|
@@ -80,4 +83,8 @@ To improve the likelihood of cache hits occurring, you should structure your req
80
83
81
84
## Can I disable prompt caching?
82
85
83
-
Prompt caching is enabled by default. There is no opt-out option.
86
+
Prompt caching is enabled by default. There is no opt-out option.
87
+
88
+
## How does prompt caching work for Provisioned deployments?
89
+
90
+
For supported models on provisioned deployments, we discount up to 100% of cached input tokens. For more information, see our [Provisioned Throughput documentation](/azure/ai-services/openai/concepts/provisioned-throughput).
0 commit comments