|
| 1 | +--- |
| 2 | +title: 'Prompt caching with Azure OpenAI Service' |
| 3 | +titleSuffix: Azure OpenAI |
| 4 | +description: Learn how to use prompt caching with Azure OpenAI |
| 5 | +services: cognitive-services |
| 6 | +manager: nitinme |
| 7 | +ms.service: azure-ai-openai |
| 8 | +ms.topic: how-to |
| 9 | +ms.date: 10/18/2024 |
| 10 | +author: mrbullwinkle |
| 11 | +ms.author: mbullwin |
| 12 | +recommendations: false |
| 13 | +--- |
| 14 | + |
| 15 | +# Prompt caching |
| 16 | + |
| 17 | +Prompt caching allows you to reduce overall request latency and cost for longer prompts that have identical content at the beginning of the prompt. *"Prompt"* in this context is referring to the input you send to the model as part of your chat completions request. Rather than re-process the same input tokens over and over again, the model is able to retain a temporary cache of processed input data to improve overall performance. Prompt caching has no impact on the output content returned in the model response beyond a reduction in latency and cost. |
| 18 | + |
| 19 | +## Supported models |
| 20 | + |
| 21 | +Currently only the following models support prompt caching with Azure OpenAI: |
| 22 | + |
| 23 | +- `o1-preview` (2024-09-12) |
| 24 | +- `o1-mini` (2024-09-12) |
| 25 | + |
| 26 | +## API support |
| 27 | + |
| 28 | +Official support for prompt caching was first added in API version `2024-10-01-preview`. |
| 29 | + |
| 30 | +## Getting started |
| 31 | + |
| 32 | +For a request to take advantage of prompt caching the request must be: |
| 33 | + |
| 34 | +- A minimum of 1024 tokens in length. |
| 35 | +- The first 1024 tokens in the prompt must be identical. |
| 36 | + |
| 37 | +When a match is found between a prompt and the current content of the prompt cache it is referred to a cache hit. Cache hits will show up as [`cached_tokens`](/azure/ai-services/openai/reference-preview#cached_tokens) under [`prompt_token_details`](/azure/ai-services/openai/reference-preview#properties-for-prompt_tokens_details) in the chat completions response. |
| 38 | + |
| 39 | + |
| 40 | +```json |
| 41 | +{ |
| 42 | + "created": 1729227448, |
| 43 | + "model": "o1-preview-2024-09-12", |
| 44 | + "object": "chat.completion", |
| 45 | + "service_tier": null, |
| 46 | + "system_fingerprint": "fp_50cdd5dc04", |
| 47 | + "usage": { |
| 48 | + "completion_tokens": 1518, |
| 49 | + "prompt_tokens": 1566, |
| 50 | + "total_tokens": 3084, |
| 51 | + "completion_tokens_details": { |
| 52 | + "audio_tokens": null, |
| 53 | + "reasoning_tokens": 576 |
| 54 | + }, |
| 55 | + "prompt_tokens_details": { |
| 56 | + "audio_tokens": null, |
| 57 | + "cached_tokens": 1408 |
| 58 | + } |
| 59 | + } |
| 60 | + |
| 61 | +``` |
| 62 | + |
| 63 | +After the first 1024 tokens cache hits will occur for every 128 additional identical tokens. |
| 64 | + |
| 65 | +A single character difference in the first 1024 tokens will result in a cache miss which is characterized by a `cached_tokens` value of 0. Prompt caching is enabled by default with no additional configuration needed for supported models. |
| 66 | + |
| 67 | +## What is cached? |
| 68 | + |
| 69 | +The o1-series models are text only and do not support system messages, images, tool use/function calling, or structured outputs. This limits the efficacy of prompt caching for these models to the user/assistant portions of the messages array which are less likely to have an identical 1024 token prefix. |
| 70 | + |
| 71 | +Once prompt caching is enabled for other supported models prompt caching will expand to support: |
| 72 | + |
| 73 | +| **Caching Supported** | **Description** | |
| 74 | +|--------|--------| |
| 75 | +|**Messages** | The complete messages array: system, user, and assistant content | |
| 76 | +|**Images** | Images included in user messages, both as links or as base64-encoded data. The detail parameter must be set the same across requests. |
| 77 | +|**Tool use**| Both the messages array and tool definitions | |
| 78 | +|**Structured outputs** | Structured output schema is appended as a prefix to the system message| |
| 79 | + |
| 80 | +To improve the likelihood of cache hits occurring you should structure your requests such that repetitive content occurs at the beginning of the messages array. |
| 81 | + |
| 82 | +## Can I disable prompt caching? |
| 83 | + |
| 84 | +Prompt caching is enabled by default. There is no opt out option. |
0 commit comments