Skip to content

Commit f8a58ce

Browse files
authored
Merge pull request #997 from MicrosoftDocs/main
10/23/2024 PM Publish
2 parents 93fac9e + 879ab51 commit f8a58ce

File tree

21 files changed

+149
-103
lines changed

21 files changed

+149
-103
lines changed

articles/ai-services/openai/concepts/provisioned-throughput.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ An Azure OpenAI Deployment is a unit of management for a specific OpenAI Model.
3535
| Latency | Max latency constrained from the model. Overall latency is a factor of call shape. |
3636
| Utilization | Provisioned-managed Utilization V2 measure provided in Azure Monitor. |
3737
| Estimating size | Provided calculator in the studio & benchmarking script. |
38+
| Prompt caching | For supported models, we discount up to 100% of cached input tokens. |
3839

3940

4041
## How much throughput per PTU you get for each model
@@ -153,12 +154,12 @@ In the Provisioned-Managed and Global Provisioned-Managed offerings, each reques
153154
For Provisioned-Managed and Global Provisioned-Managed, we use a variation of the leaky bucket algorithm to maintain utilization below 100% while allowing some burstiness in the traffic. The high-level logic is as follows:
154155

155156
1. Each customer has a set amount of capacity they can utilize on a deployment
156-
2. When a request is made:
157+
1. When a request is made:
157158

158-
a. When the current utilization is above 100%, the service returns a 429 code with the `retry-after-ms` header set to the time until utilization is below 100%
159-
160-
b. Otherwise, the service estimates the incremental change to utilization required to serve the request by combining prompt tokens and the specified `max_tokens` in the call. If the `max_tokens` parameter is not specified, the service estimates a value. This estimation can lead to lower concurrency than expected when the number of actual generated tokens is small. For highest concurrency, ensure that the `max_tokens` value is as close as possible to the true generation size.
159+
a. When the current utilization is above 100%, the service returns a 429 code with the `retry-after-ms` header set to the time until utilization is below 100%
161160

161+
b. Otherwise, the service estimates the incremental change to utilization required to serve the request by combining prompt tokens and the specified `max_tokens` in the call. For requests that include at least 1024 cached tokens, the cached tokens are subtracted from the prompt token value. A customer can receive up to a 100% discount on their prompt tokens depending on the size of their cached tokens. If the `max_tokens` parameter is not specified, the service estimates a value. This estimation can lead to lower concurrency than expected when the number of actual generated tokens is small. For highest concurrency, ensure that the `max_tokens` value is as close as possible to the true generation size.
162+
162163
3. When a request finishes, we now know the actual compute cost for the call. To ensure an accurate accounting, we correct the utilization using the following logic:
163164

164165
a. If the actual > estimated, then the difference is added to the deployment's utilization

articles/ai-services/openai/how-to/prompt-caching.md

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,18 +14,21 @@ recommendations: false
1414

1515
# Prompt caching
1616

17-
Prompt caching allows you to reduce overall request latency and cost for longer prompts that have identical content at the beginning of the prompt. *"Prompt"* in this context is referring to the input you send to the model as part of your chat completions request. Rather than reprocess the same input tokens over and over again, the model is able to retain a temporary cache of processed input data to improve overall performance. Prompt caching has no impact on the output content returned in the model response beyond a reduction in latency and cost.
17+
Prompt caching allows you to reduce overall request latency and cost for longer prompts that have identical content at the beginning of the prompt. *"Prompt"* in this context is referring to the input you send to the model as part of your chat completions request. Rather than reprocess the same input tokens over and over again, the model is able to retain a temporary cache of processed input data to improve overall performance. Prompt caching has no impact on the output content returned in the model response beyond a reduction in latency and cost. For supported models, cached tokens are billed at a [50% discount on input token pricing](https://azure.microsoft.com/pricing/details/cognitive-services/openai-service/).
1818

1919
## Supported models
2020

2121
Currently only the following models support prompt caching with Azure OpenAI:
2222

2323
- `o1-preview-2024-09-12`
2424
- `o1-mini-2024-09-12`
25+
- `gpt-4o-2024-05-13`
26+
- `gpt-4o-2024-08-06`
27+
- `gpt-4o-mini-2024-07-18`
2528

2629
## API support
2730

28-
Official support for prompt caching was first added in API version `2024-10-01-preview`.
31+
Official support for prompt caching was first added in API version `2024-10-01-preview`. At this time, only `o1-preview-2024-09-12` and `o1-mini-2024-09-12` models support the `cached_tokens` API response parameter.
2932

3033
## Getting started
3134

@@ -67,7 +70,7 @@ A single character difference in the first 1,024 tokens will result in a cache m
6770

6871
The o1-series models are text only and don't support system messages, images, tool use/function calling, or structured outputs. This limits the efficacy of prompt caching for these models to the user/assistant portions of the messages array which are less likely to have an identical 1024 token prefix.
6972

70-
Once prompt caching is enabled for other supported models prompt caching will expand to support:
73+
For `gpt-4o` and `gpt-4o-mini` models, prompt caching is supported for:
7174

7275
| **Caching Supported** | **Description** |
7376
|--------|--------|
@@ -80,4 +83,8 @@ To improve the likelihood of cache hits occurring, you should structure your req
8083

8184
## Can I disable prompt caching?
8285

83-
Prompt caching is enabled by default. There is no opt-out option.
86+
Prompt caching is enabled by default. There is no opt-out option.
87+
88+
## How does prompt caching work for Provisioned deployments?
89+
90+
For supported models on provisioned deployments, we discount up to 100% of cached input tokens. For more information, see our [Provisioned Throughput documentation](/azure/ai-services/openai/concepts/provisioned-throughput).

articles/ai-services/openai/quotas-limits.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ ms.custom:
1010
- ignite-2023
1111
- references_regions
1212
ms.topic: conceptual
13-
ms.date: 10/11/2024
13+
ms.date: 10/23/2024
1414
ms.author: mbullwin
1515
---
1616

@@ -132,14 +132,14 @@ The Usage Limit determines the level of usage above which customers might see la
132132

133133
|Model| Usage Tiers per month |
134134
|----|----|
135-
|`gpt-4o` | 8 Billion tokens |
136-
|`gpt-4o-mini` | 45 Billion tokens |
135+
|`gpt-4o` | 12 Billion tokens |
136+
|`gpt-4o-mini` | 85 Billion tokens |
137137

138138
#### GPT-4 standard
139139

140140
|Model| Usage Tiers per month|
141141
|---|---|
142-
| `gpt-4` + `gpt-4-32k` (all versions) | 4 Billion |
142+
| `gpt-4` + `gpt-4-32k` (all versions) | 6 Billion |
143143

144144

145145
## Other offer types

articles/ai-services/speech-service/includes/quickstarts/captioning/intro.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,4 +9,7 @@ ms.author: eur
99
In this quickstart, you run a console app to create [captions](~/articles/ai-services/speech-service/captioning-concepts.md) with speech to text.
1010

1111
> [!TIP]
12-
> Try the [Speech Studio](https://aka.ms/speechstudio/captioning) and choose a sample video clip to see real-time or offline processed captioning results.
12+
> Try out the [Speech Studio](https://aka.ms/speechstudio/captioning) and choose a sample video clip to see real-time or offline processed captioning results.
13+
14+
> [!TIP]
15+
> Try out the [Azure AI Speech Toolkit](https://marketplace.visualstudio.com/items?itemName=ms-azureaispeech.azure-ai-speech-toolkit) to easily build and run captioning samples on Visual Studio Code.

articles/ai-services/speech-service/includes/quickstarts/speech-to-text-basics/cpp.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,9 @@ The Speech SDK is available as a [NuGet package](https://www.nuget.org/packages/
2424

2525
## Recognize speech from a microphone
2626

27+
> [!TIP]
28+
> Try out the [Azure AI Speech Toolkit](https://marketplace.visualstudio.com/items?itemName=ms-azureaispeech.azure-ai-speech-toolkit) to easily build and run samples on Visual Studio Code.
29+
2730
Follow these steps to create a console application and install the Speech SDK.
2831

2932
1. Create a new C++ console project in [Visual Studio Community](https://visualstudio.microsoft.com/downloads/) named `SpeechRecognition`.

articles/ai-services/speech-service/includes/quickstarts/speech-to-text-basics/csharp.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,9 @@ The Speech SDK is available as a [NuGet package](https://www.nuget.org/packages/
2424

2525
## Recognize speech from a microphone
2626

27+
> [!TIP]
28+
> Try out the [Azure AI Speech Toolkit](https://marketplace.visualstudio.com/items?itemName=ms-azureaispeech.azure-ai-speech-toolkit) to easily build and run samples on Visual Studio Code.
29+
2730
Follow these steps to create a console application and install the Speech SDK.
2831

2932
1. Open a command prompt window in the folder where you want the new project. Run this command to create a console application with the .NET CLI.

articles/ai-services/speech-service/includes/quickstarts/speech-to-text-basics/javascript.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,9 @@ To set up your environment, install the Speech SDK for JavaScript. Run this comm
2626

2727
## Recognize speech from a file
2828

29+
> [!TIP]
30+
> Try out the [Azure AI Speech Toolkit](https://marketplace.visualstudio.com/items?itemName=ms-azureaispeech.azure-ai-speech-toolkit) to easily build and run samples on Visual Studio Code.
31+
2932
Follow these steps to create a Node.js console application for speech recognition.
3033

3134
1. Open a command prompt window where you want the new project, and create a new file named *SpeechRecognition.js*.

articles/ai-services/speech-service/includes/quickstarts/speech-to-text-basics/python.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,9 @@ Install a version of [Python from 3.7 or later](https://www.python.org/downloads
2929

3030
## Recognize speech from a microphone
3131

32+
> [!TIP]
33+
> Try out the [Azure AI Speech Toolkit](https://marketplace.visualstudio.com/items?itemName=ms-azureaispeech.azure-ai-speech-toolkit) to easily build and run samples on Visual Studio Code.
34+
3235
Follow these steps to create a console application.
3336

3437
1. Open a command prompt window in the folder where you want the new project. Create a new file named *speech_recognition.py*.

articles/ai-services/speech-service/includes/quickstarts/speech-translation-basics/intro.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,3 +7,6 @@ ms.author: eur
77
---
88

99
In this quickstart, you run an application to translate speech from one language to text in another language.
10+
11+
> [!TIP]
12+
> Try out the [Azure AI Speech Toolkit](https://marketplace.visualstudio.com/items?itemName=ms-azureaispeech.azure-ai-speech-toolkit) to easily build and run samples on Visual Studio Code.

articles/ai-services/speech-service/includes/quickstarts/text-to-speech-basics/intro.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,3 +10,6 @@ With Azure AI Speech, you can run an application that synthesizes a human-like v
1010

1111
> [!TIP]
1212
> You can try text to speech in the [Speech Studio Voice Gallery](https://aka.ms/speechstudio/voicegallery) without signing up or writing any code.
13+
14+
> [!TIP]
15+
> Try out the [Azure AI Speech Toolkit](https://marketplace.visualstudio.com/items?itemName=ms-azureaispeech.azure-ai-speech-toolkit) to easily build and run samples on Visual Studio Code.

0 commit comments

Comments
 (0)