Skip to content

Commit f55b5c3

Browse files
committed
updatE
1 parent d213900 commit f55b5c3

File tree

2 files changed

+86
-0
lines changed

2 files changed

+86
-0
lines changed
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
---
2+
title: 'Prompt caching with Azure OpenAI Service'
3+
titleSuffix: Azure OpenAI
4+
description: Learn how to use prompt caching with Azure OpenAI
5+
services: cognitive-services
6+
manager: nitinme
7+
ms.service: azure-ai-openai
8+
ms.topic: how-to
9+
ms.date: 10/18/2024
10+
author: mrbullwinkle
11+
ms.author: mbullwin
12+
recommendations: false
13+
---
14+
15+
# Prompt caching
16+
17+
Prompt caching allows you to reduce overall request latency and cost for longer prompts that have identical content at the beginning of the prompt. *"Prompt"* in this context is referring to the input you send to the model as part of your chat completions request. Rather than re-process the same input tokens over and over again, the model is able to retain a temporary cache of processed input data to improve overall performance. Prompt caching has no impact on the output content returned in the model response beyond a reduction in latency and cost.
18+
19+
## Supported models
20+
21+
Currently only the following models support prompt caching with Azure OpenAI:
22+
23+
- `o1-preview` (2024-09-12)
24+
- `o1-mini` (2024-09-12)
25+
26+
## API support
27+
28+
Official support for prompt caching was first added in API version `2024-10-01-preview`.
29+
30+
## Getting started
31+
32+
For a request to take advantage of prompt caching the request must be:
33+
34+
- A minimum of 1024 tokens in length.
35+
- The first 1024 tokens in the prompt must be identical.
36+
37+
When a match is found between a prompt and the current content of the prompt cache it is referred to a cache hit. Cache hits will show up as [`cached_tokens`](/azure/ai-services/openai/reference-preview#cached_tokens) under [`prompt_token_details`](/azure/ai-services/openai/reference-preview#properties-for-prompt_tokens_details) in the chat completions response.
38+
39+
40+
```json
41+
{
42+
"created": 1729227448,
43+
"model": "o1-preview-2024-09-12",
44+
"object": "chat.completion",
45+
"service_tier": null,
46+
"system_fingerprint": "fp_50cdd5dc04",
47+
"usage": {
48+
"completion_tokens": 1518,
49+
"prompt_tokens": 1566,
50+
"total_tokens": 3084,
51+
"completion_tokens_details": {
52+
"audio_tokens": null,
53+
"reasoning_tokens": 576
54+
},
55+
"prompt_tokens_details": {
56+
"audio_tokens": null,
57+
"cached_tokens": 1408
58+
}
59+
}
60+
61+
```
62+
63+
After the first 1024 tokens cache hits will occur for every 128 additional identical tokens.
64+
65+
A single character difference in the first 1024 tokens will result in a cache miss which is characterized by a `cached_tokens` value of 0. Prompt caching is enabled by default with no additional configuration needed for supported models.
66+
67+
## What is cached?
68+
69+
The o1-series models are text only and do not support system messages, images, tool use/function calling, or structured outputs. This limits the efficacy of prompt caching for these models to the user/assistant portions of the messages array which are less likely to have an identical 1024 token prefix.
70+
71+
Once prompt caching is enabled for other supported models prompt caching will expand to support:
72+
73+
| **Caching Supported** | **Description** |
74+
|--------|--------|
75+
|**Messages** | The complete messages array: system, user, and assistant content |
76+
|**Images** | Images included in user messages, both as links or as base64-encoded data. The detail parameter must be set the same across requests.
77+
|**Tool use**| Both the messages array and tool definitions |
78+
|**Structured outputs** | Structured output schema is appended as a prefix to the system message|
79+
80+
To improve the likelihood of cache hits occurring you should structure your requests such that repetitive content occurs at the beginning of the messages array.
81+
82+
## Can I disable prompt caching?
83+
84+
Prompt caching is enabled by default. There is no opt out option.

articles/ai-services/openai/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,6 +130,8 @@ items:
130130
href: ./how-to/completions.md
131131
- name: JSON mode
132132
href: ./how-to/json-mode.md
133+
- name: Prompt caching
134+
href: ./how-to/prompt-caching.md
133135
- name: Reproducible output
134136
href: ./how-to/reproducible-output.md
135137
- name: Structured outputs

0 commit comments

Comments
 (0)