You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-foundry/openai/how-to/prompt-caching.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,28 +6,28 @@ services: cognitive-services
6
6
manager: nitinme
7
7
ms.service: azure-ai-openai
8
8
ms.topic: how-to
9
-
ms.date: 06/25/2025
9
+
ms.date: 07/23/2025
10
10
author: mrbullwinkle
11
11
ms.author: mbullwin
12
12
recommendations: false
13
13
---
14
14
15
15
# Prompt caching
16
16
17
-
Prompt caching allows you to reduce overall request latency and cost for longer prompts that have identical content at the beginning of the prompt. *"Prompt"* in this context is referring to the input you send to the model as part of your chat completions request. Rather than reprocess the same input tokens over and over again, the service is able to retain a temporary cache of processed input token computations to improve overall performance. Prompt caching has no impact on the output content returned in the model response beyond a reduction in latency and cost. For supported models, cached tokens are billed at a [discount on input token pricing](https://azure.microsoft.com/pricing/details/cognitive-services/openai-service/) for Standard deployment types and up to [100% discount on input tokens](/azure/ai-services/openai/concepts/provisioned-throughput) for Provisioned deployment types.
17
+
Prompt caching allows you to reduce overall request latency and cost for longer prompts that have identical content at the beginning of the prompt. *"Prompt"* in this context is referring to the input you send to the model as part of your chat completions request. Rather than reprocess the same input tokens over and over again, the service is able to retain a temporary cache of processed input token computations to improve overall performance. Prompt caching has no impact on the output content returned in the model response beyond a reduction in latency and cost. For supported models, cached tokens are billed at a [discount on input token pricing](https://azure.microsoft.com/pricing/details/cognitive-services/openai-service/) for Standard deployment types and up to [100% discount on input tokens](/azure/ai-services/openai/concepts/provisioned-throughput) for Provisioned deployment types. If you provide the `user` parameter, it's combined with a prefix hash, allowing you to influence routing and improve cache hit rates. This is especially beneficial when many requests share long, common prefixes.
18
18
19
-
Caches are typically cleared within 5-10 minutes of inactivity and are always removed within one hour of the cache's last use. Prompt caches are not shared between Azure subscriptions.
19
+
Caches are typically cleared within 5-10 minutes of inactivity and are always removed within one hour of the cache's last use. Prompt caches aren't shared between Azure subscriptions.
20
20
21
21
## Supported models
22
22
23
23
- Prompt caching is supported with all Azure OpenAI models GPT-4o or newer.
24
-
- Prompt caching applies to models that have chat-completion, completion, responses, or real-time operations. For models which do not have these operations, this feature is not available.
24
+
- Prompt caching applies to models that have chat-completion, completion, responses, or real-time operations. For models which don't have these operations, this feature isn't available.
25
25
26
26
## API support
27
27
28
28
Official support for prompt caching was first added in API version `2024-10-01-preview`. At this time, only the o-series model family supports the `cached_tokens` API response parameter.
29
29
30
-
## Getting started
30
+
## Get started
31
31
32
32
For a request to take advantage of prompt caching the request must be both:
33
33
@@ -80,4 +80,4 @@ To improve the likelihood of cache hits occurring, you should structure your req
80
80
81
81
## Can I disable prompt caching?
82
82
83
-
Prompt caching is enabled by default for all supported models. There is no opt-out support for prompt caching.
83
+
Prompt caching is enabled by default for all supported models. There's no opt-out support for prompt caching.
0 commit comments