Skip to content

Commit bfff5c3

Browse files
Merge pull request #6205 from mrbullwinkle/mrb_07_25_2025_prompt_caching
[Azure OpenAI] Prompt caching user update
2 parents 1253c9a + ab50fb5 commit bfff5c3

File tree

1 file changed

+7
-3
lines changed

1 file changed

+7
-3
lines changed

articles/ai-foundry/openai/how-to/prompt-caching.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,15 +6,15 @@ services: cognitive-services
66
manager: nitinme
77
ms.service: azure-ai-openai
88
ms.topic: how-to
9-
ms.date: 07/23/2025
9+
ms.date: 07/24/2025
1010
author: mrbullwinkle
1111
ms.author: mbullwin
1212
recommendations: false
1313
---
1414

1515
# Prompt caching
1616

17-
Prompt caching allows you to reduce overall request latency and cost for longer prompts that have identical content at the beginning of the prompt. *"Prompt"* in this context is referring to the input you send to the model as part of your chat completions request. Rather than reprocess the same input tokens over and over again, the service is able to retain a temporary cache of processed input token computations to improve overall performance. Prompt caching has no impact on the output content returned in the model response beyond a reduction in latency and cost. For supported models, cached tokens are billed at a [discount on input token pricing](https://azure.microsoft.com/pricing/details/cognitive-services/openai-service/) for Standard deployment types and up to [100% discount on input tokens](/azure/ai-services/openai/concepts/provisioned-throughput) for Provisioned deployment types. If you provide the `user` parameter, it's combined with a prefix hash, allowing you to influence routing and improve cache hit rates. This is especially beneficial when many requests share long, common prefixes.
17+
Prompt caching allows you to reduce overall request latency and cost for longer prompts that have identical content at the beginning of the prompt. *"Prompt"* in this context is referring to the input you send to the model as part of your chat completions request. Rather than reprocess the same input tokens over and over again, the service is able to retain a temporary cache of processed input token computations to improve overall performance. Prompt caching has no impact on the output content returned in the model response beyond a reduction in latency and cost. For supported models, cached tokens are billed at a [discount on input token pricing](https://azure.microsoft.com/pricing/details/cognitive-services/openai-service/) for Standard deployment types and up to [100% discount on input tokens](/azure/ai-services/openai/concepts/provisioned-throughput) for Provisioned deployment types.
1818

1919
Caches are typically cleared within 5-10 minutes of inactivity and are always removed within one hour of the cache's last use. Prompt caches aren't shared between Azure subscriptions.
2020

@@ -27,13 +27,15 @@ Caches are typically cleared within 5-10 minutes of inactivity and are always re
2727

2828
Official support for prompt caching was first added in API version `2024-10-01-preview`. At this time, only the o-series model family supports the `cached_tokens` API response parameter.
2929

30-
## Get started
30+
## Getting started
3131

3232
For a request to take advantage of prompt caching the request must be both:
3333

3434
- A minimum of 1,024 tokens in length.
3535
- The first 1,024 tokens in the prompt must be identical.
3636

37+
Requests are routed based on a hash of the initial prefix of a prompt.
38+
3739
When a match is found between the token computations in a prompt and the current content of the prompt cache, it's referred to as a cache hit. Cache hits will show up as [`cached_tokens`](/azure/ai-services/openai/reference-preview#cached_tokens) under [`prompt_tokens_details`](/azure/ai-services/openai/reference-preview#properties-for-prompt_tokens_details) in the chat completions response.
3840

3941
```json
@@ -63,6 +65,8 @@ After the first 1,024 tokens cache hits will occur for every 128 additional iden
6365

6466
A single character difference in the first 1,024 tokens will result in a cache miss which is characterized by a `cached_tokens` value of 0. Prompt caching is enabled by default with no additional configuration needed for supported models.
6567

68+
If you provide the [`user`](/azure/ai-foundry/openai/reference-preview-latest#request-body-2) parameter, it's combined with the prefix hash, allowing you to influence routing and improve cache hit rates. This is especially beneficial when many requests share long, common prefixes.
69+
6670
## What is cached?
6771

6872
o1-series models feature support varies by model. For more information, see our dedicated [reasoning models guide](./reasoning.md).

0 commit comments

Comments
 (0)