Skip to content

Commit 7660708

Browse files
authored
Merge pull request #897 from mrbullwinkle/mrb_10_18_2024_prompt_caching
[Azure OpenAI] Prompt caching
2 parents 51e549e + 0715732 commit 7660708

File tree

2 files changed

+85
-0
lines changed

2 files changed

+85
-0
lines changed
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
---
2+
title: 'Prompt caching with Azure OpenAI Service'
3+
titleSuffix: Azure OpenAI
4+
description: Learn how to use prompt caching with Azure OpenAI
5+
services: cognitive-services
6+
manager: nitinme
7+
ms.service: azure-ai-openai
8+
ms.topic: how-to
9+
ms.date: 10/18/2024
10+
author: mrbullwinkle
11+
ms.author: mbullwin
12+
recommendations: false
13+
---
14+
15+
# Prompt caching
16+
17+
Prompt caching allows you to reduce overall request latency and cost for longer prompts that have identical content at the beginning of the prompt. *"Prompt"* in this context is referring to the input you send to the model as part of your chat completions request. Rather than reprocess the same input tokens over and over again, the model is able to retain a temporary cache of processed input data to improve overall performance. Prompt caching has no impact on the output content returned in the model response beyond a reduction in latency and cost.
18+
19+
## Supported models
20+
21+
Currently only the following models support prompt caching with Azure OpenAI:
22+
23+
- `o1-preview-2024-09-12`
24+
- `o1-mini-2024-09-12`
25+
26+
## API support
27+
28+
Official support for prompt caching was first added in API version `2024-10-01-preview`.
29+
30+
## Getting started
31+
32+
For a request to take advantage of prompt caching the request must be both:
33+
34+
- A minimum of 1,024 tokens in length.
35+
- The first 1,024 tokens in the prompt must be identical.
36+
37+
When a match is found between a prompt and the current content of the prompt cache, it's referred to as a cache hit. Cache hits will show up as [`cached_tokens`](/azure/ai-services/openai/reference-preview#cached_tokens) under [`prompt_token_details`](/azure/ai-services/openai/reference-preview#properties-for-prompt_tokens_details) in the chat completions response.
38+
39+
```json
40+
{
41+
"created": 1729227448,
42+
"model": "o1-preview-2024-09-12",
43+
"object": "chat.completion",
44+
"service_tier": null,
45+
"system_fingerprint": "fp_50cdd5dc04",
46+
"usage": {
47+
"completion_tokens": 1518,
48+
"prompt_tokens": 1566,
49+
"total_tokens": 3084,
50+
"completion_tokens_details": {
51+
"audio_tokens": null,
52+
"reasoning_tokens": 576
53+
},
54+
"prompt_tokens_details": {
55+
"audio_tokens": null,
56+
"cached_tokens": 1408
57+
}
58+
}
59+
}
60+
```
61+
62+
After the first 1,024 tokens cache hits will occur for every 128 additional identical tokens.
63+
64+
A single character difference in the first 1,024 tokens will result in a cache miss which is characterized by a `cached_tokens` value of 0. Prompt caching is enabled by default with no additional configuration needed for supported models.
65+
66+
## What is cached?
67+
68+
The o1-series models are text only and don't support system messages, images, tool use/function calling, or structured outputs. This limits the efficacy of prompt caching for these models to the user/assistant portions of the messages array which are less likely to have an identical 1024 token prefix.
69+
70+
Once prompt caching is enabled for other supported models prompt caching will expand to support:
71+
72+
| **Caching Supported** | **Description** |
73+
|--------|--------|
74+
|**Messages** | The complete messages array: system, user, and assistant content |
75+
|**Images** | Images included in user messages, both as links or as base64-encoded data. The detail parameter must be set the same across requests.
76+
|**Tool use**| Both the messages array and tool definitions |
77+
|**Structured outputs** | Structured output schema is appended as a prefix to the system message|
78+
79+
To improve the likelihood of cache hits occurring, you should structure your requests such that repetitive content occurs at the beginning of the messages array.
80+
81+
## Can I disable prompt caching?
82+
83+
Prompt caching is enabled by default. There is no opt-out option.

articles/ai-services/openai/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,6 +130,8 @@ items:
130130
href: ./how-to/completions.md
131131
- name: JSON mode
132132
href: ./how-to/json-mode.md
133+
- name: Prompt caching
134+
href: ./how-to/prompt-caching.md
133135
- name: Reproducible output
134136
href: ./how-to/reproducible-output.md
135137
- name: Structured outputs

0 commit comments

Comments
 (0)