Skip to content

Commit 3fe8447

Browse files
authored
Merge pull request #2352 from MicrosoftDocs/main
1/16/2025 AM Publish
2 parents 438b0a1 + 79b7fe5 commit 3fe8447

35 files changed

+567
-316
lines changed

articles/ai-services/computer-vision/overview-identity.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,8 @@ Or, you can try out the capabilities of Face service quickly and easily in your
3333
3434
[!INCLUDE [Gate notice](./includes/identity-gate-notice.md)]
3535

36+
[!INCLUDE [GDPR-related guidance](./includes/identity-data-notice.md)]
37+
3638

3739
This documentation contains the following types of articles:
3840
* The [quickstarts](./quickstarts-sdk/identity-client-library.md) are step-by-step instructions that let you make calls to the service and get results in a short period of time.
@@ -58,7 +60,7 @@ The following are common use cases for the Face service:
5860
See the [customer checkin management](https://github.com/Azure-Samples/azure-ai-vision/tree/main/face/Scenario-CustomerCheckinManagement) and [face photo tagging](https://github.com/Azure-Samples/azure-ai-vision/tree/main/face/Scenario-FacePhotoTagging) scenarios on GitHub for working examples of facial recognition technology.
5961

6062
> [!WARNING]
61-
> On June 11, 2020, Microsoft announced that it will not sell facial recognition technology to police departments in the United States until strong regulation, grounded in human rights, has been enacted. As such, customers may not use facial recognition features or functionality included in Azure Services, such as Face or Video Indexer, if a customer is, or is allowing use of such services by or for, a police department in the United States. When you create a new Face resource, you must acknowledge and agree in the Azure Portal that you will not use the service by or for a police department in the United States and that you have reviewed the Responsible AI documentation and will use this service in accordance with it.
63+
> On June 11, 2020, Microsoft announced that it will not sell facial recognition technology to police departments in the United States until strong regulation, grounded in human rights, has been enacted. As such, customers may not use facial recognition features or functionality included in Azure Services, such as Face or Video Indexer, if a customer is, or is allowing use of such services by or for, a police department in the United States. When you create a new Face resource, you must acknowledge and agree in the Azure portal that you will not use the service by or for a police department in the United States and that you have reviewed the Responsible AI documentation and will use this service in accordance with it.
6264
6365
## Face detection and analysis
6466

@@ -101,7 +103,6 @@ Face liveness SDK reference docs:
101103

102104
Modern enterprises and apps can use the Face recognition technologies, including Face verification ("one-to-one" matching) and Face identification ("one-to-many" matching) to confirm that a user is who they claim to be.
103105

104-
[!INCLUDE [GDPR-related guidance](./includes/identity-data-notice.md)]
105106

106107
### Identification
107108

articles/ai-services/computer-vision/whats-new.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ These Image Analysis 4.0 preview APIs will be retired on March 31, 2025:
3434
- `2023-07-01-preview`
3535
- `v4.0-preview.1`
3636

37-
These features will no longer be available with the retirement of the preview API versions:
37+
The following features will no longer be available upon retirement of the preview API versions, and they are removed from the Studio experience as of January 10, 2025:
3838
- Model customization
3939
- Background removal
4040
- Product recognition

articles/ai-services/openai/concepts/model-retirements.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,7 @@ These models are currently available for use in Azure OpenAI Service.
107107
| `gpt-4` | vision-preview | To be upgraded to `gpt-4` version: `turbo-2024-04-09`, starting no sooner than January 27, 2025 **<sup>1</sup>** | `gpt-4o`|
108108
| `gpt-4o` | 2024-05-13 | No earlier than May 20, 2025 <br><br>Deployments set to [**Auto-update to default**](/azure/ai-services/openai/how-to/working-with-models?tabs=powershell#auto-update-to-default) will be automatically upgraded to version: `2024-08-06`, starting on February 13, 2025. | |
109109
| `gpt-4o-mini` | 2024-07-18 | No earlier than July 18, 2025 | |
110+
| `gpt-4o-realtime-preview` | 2024-10-01 | No earlier than September 30, 2025 | `gpt-4o-realtime-preview` (version 2024-12-17) |
110111
| `gpt-3.5-turbo-instruct` | 0914 | No earlier than April 1, 2025 | |
111112
| `o1` | 2024-12-17 | No earlier than December 17, 2025 | |
112113
| `text-embedding-ada-002` | 2 | No earlier than October 3, 2025 | `text-embedding-3-small` or `text-embedding-3-large` |

articles/ai-services/openai/concepts/models.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -58,17 +58,18 @@ To learn more about the advanced `o1` series models see, [getting started with o
5858

5959
## GPT-4o-Realtime-Preview
6060

61-
The `gpt-4o-realtime-preview` model is part of the GPT-4o model family and supports low-latency, "speech in, speech out" conversational interactions. GPT-4o audio is designed to handle real-time, low-latency conversational interactions, making it a great fit for support agents, assistants, translators, and other use cases that need highly responsive back-and-forth with a user.
61+
The GPT 4o audio models are part of the GPT-4o model family and support low-latency, "speech in, speech out" conversational interactions. GPT-4o audio is designed to handle real-time, low-latency conversational interactions, making it a great fit for support agents, assistants, translators, and other use cases that need highly responsive back-and-forth with a user.
6262

6363
GPT-4o audio is available in the East US 2 (`eastus2`) and Sweden Central (`swedencentral`) regions. To use GPT-4o audio, you need to [create](../how-to/create-resource.md) or use an existing resource in one of the supported regions.
6464

65-
When your resource is created, you can [deploy](../how-to/create-resource.md#deploy-a-model) the GPT-4o audio model. If you are performing a programmatic deployment, the **model** name is `gpt-4o-realtime-preview`. For more information on how to use GPT-4o audio, see the [GPT-4o audio documentation](../realtime-audio-quickstart.md).
65+
When your resource is created, you can [deploy](../how-to/create-resource.md#deploy-a-model) the GPT-4o audio model. For more information on how to use GPT-4o audio, see the [GPT-4o audio quickstart](../realtime-audio-quickstart.md) and [how to use GPT-4o audio](../how-to/realtime-audio.md).
6666

6767
Details about maximum request tokens and training data are available in the following table.
6868

6969
| Model ID | Description | Max Request (tokens) | Training Data (up to) |
70-
| --- | :--- |:--- |:---: |
71-
|`gpt-4o-realtime-preview` (2024-10-01-preview) <br> **GPT-4o audio** | **Audio model** for real-time audio processing |Input: 128,000 <br> Output: 4,096 | Oct 2023 |
70+
|---|---|---|---|
71+
|`gpt-4o-realtime-preview` (2024-10-01) <br> **GPT-4o audio** | **Audio model** for real-time audio processing |Input: 128,000 <br> Output: 4,096 | Oct 2023 |
72+
|`gpt-4o-realtime-preview` (2024-12-17) <br> **GPT-4o audio** | **Audio model** for real-time audio processing |Input: 128,000 <br> Output: 4,096 | Oct 2023 |
7273

7374
## GPT-4o and GPT-4 Turbo
7475

articles/ai-services/openai/concepts/provisioned-throughput.md

Lines changed: 40 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -30,35 +30,34 @@ An Azure OpenAI Deployment is a unit of management for a specific OpenAI Model.
3030

3131
| Topic | Provisioned|
3232
|---|---|
33-
| What is it? | Provides guaranteed throughput at smaller increments than the existing provisioned offer. Deployments have a consistent max latency for a given model-version. |
33+
| What is it? |Provides guaranteed throughput at smaller increments than the existing provisioned offer. Deployments have a consistent max latency for a given model-version. |
3434
| Who is it for? | Customers who want guaranteed throughput with minimal latency variance. |
3535
| Quota |Provisioned Managed Throughput Unit, Global Provisioned Managed Throughput Unit, or Data Zone Provisioned Managed Throughput Unit assigned per region. Quota can be used across any available Azure OpenAI model.|
3636
| Latency | Max latency constrained from the model. Overall latency is a factor of call shape. |
3737
| Utilization | Provisioned-managed Utilization V2 measure provided in Azure Monitor. |
38-
| Estimating size | Provided calculator in Azure AI Foundry & benchmarking script. |
38+
|Estimating size |Provided sizing calculator in Azure AI Foundry.|
3939
|Prompt caching | For supported models, we discount up to 100% of cached input tokens. |
4040

4141

4242
## How much throughput per PTU you get for each model
43-
The amount of throughput (tokens per minute or TPM) a deployment gets per PTU is a function of the input and output tokens in the minute. Generating output tokens requires more processing than input tokens and so the more output tokens generated the lower your overall TPM. The service dynamically balances the input & output costs, so users do not have to set specific input and output limits. This approach means your deployment is resilient to fluctuations in the workload shape.
43+
The amount of throughput (tokens per minute or TPM) a deployment gets per PTU is a function of the input and output tokens in the minute. Generating output tokens requires more processing than input tokens. For the models specified in the table below, 1 output token counts as 3 input tokens towards your TPM per PTU limit. The service dynamically balances the input & output costs, so users do not have to set specific input and output limits. This approach means your deployment is resilient to fluctuations in the workload shape.
4444

45-
To help with simplifying the sizing effort, the following table outlines the TPM per PTU for the `gpt-4o` and `gpt-4o-mini` models which represent the max TPM assuming all traffic is either input or output. To understand how different ratios of input and output tokens impact your Max TPM per PTU, see the [Azure OpenAI capacity calculator](https://oai.azure.com/portal/calculator). The table also shows Service Level Agreement (SLA) Latency Target Values per model. For more information about the SLA for Azure OpenAI Service, see the [Service Level Agreements (SLA) for Online Services page](https://www.microsoft.com/licensing/docs/view/Service-Level-Agreements-SLA-for-Online-Services?lang=1)
45+
To help with simplifying the sizing effort, the following table outlines the TPM per PTU for the specified models. To understand the impact of output tokens on the TPM per PTU limit, use the 3 input token to 1 output token ratio. For a detailed understanding of how different ratios of input and output tokens impact the throughput your workload needs, see the [Azure OpenAI capacity calculator](https://oai.azure.com/portal/calculator). The table also shows Service Level Agreement (SLA) Latency Target Values per model. For more information about the SLA for Azure OpenAI Service, see the [Service Level Agreements (SLA) for Online Services page](https://www.microsoft.com/licensing/docs/view/Service-Level-Agreements-SLA-for-Online-Services?lang=1)
4646

47-
|Topic| **gpt-4o**, **2024-05-13** & **gpt-4o**, **2024-08-06** | **gpt-4o-mini**, **2024-07-18** |
47+
|Topic| **gpt-4o** | **gpt-4o-mini** |
4848
| --- | --- | --- |
4949
|Global & data zone provisioned minimum deployment|15|15|
5050
|Global & data zone provisioned scale increment|5|5|
5151
|Regional provisioned minimum deployment | 50 | 25|
5252
|Regional provisioned scale increment|50|25|
53-
|Max Input TPM per PTU | 2,500 | 37,000 |
54-
|Max Output TPM per PTU| 833|12,333|
53+
|Input TPM per PTU | 2,500 | 37,000 |
5554
|Latency Target Value |25 Tokens Per Second|33 Tokens Per Second|
5655

5756
For a full list see the [Azure OpenAI Service in Azure AI Foundry portal calculator](https://oai.azure.com/portal/calculator).
5857

5958

6059
> [!NOTE]
61-
> Global provisioned deployments are only supported for gpt-4o, 2024-08-06 and gpt-4o-mini, 2024-07-18 models at this time. Data zone provisioned deployments are only supported for gpt-4o, 2024-08-06, gpt-4o, 2024-05-13, and gpt-4o-mini, 2024-07-18 models at this time. For more information on model availability, review the [models documentation](./models.md).
60+
> Global provisioned and data zone provisioned deployments are only supported for gpt-4o and gpt-4o-mini models at this time. For more information on model availability, review the [models documentation](./models.md).
6261
6362
## Key concepts
6463

@@ -73,11 +72,11 @@ az cognitiveservices account deployment create \
7372
--name <myResourceName> \
7473
--resource-group <myResourceGroupName> \
7574
--deployment-name MyDeployment \
76-
--model-name gpt-4 \
77-
--model-version 0613 \
75+
--model-name gpt-4o \
76+
--model-version 2024-08-06 \
7877
--model-format OpenAI \
79-
--sku-capacity 100 \
80-
--sku-name ProvisionedManaged
78+
--sku-capacity 15 \
79+
--sku-name GlobalProvisionedManaged
8180
```
8281

8382
### Quota
@@ -132,7 +131,7 @@ If an acceptable region isn't available to support the desire model, version and
132131

133132
### Determining the number of PTUs needed for a workload
134133

135-
PTUs represent an amount of model processing capacity. Similar to your computer or databases, different workloads or requests to the model will consume different amounts of underlying processing capacity. The conversion from call shape characteristics (prompt size, generation size and call rate) to PTUs is complex and nonlinear. To simplify this process, you can use the [Azure OpenAI Capacity calculator](https://oai.azure.com/portal/calculator) to size specific workload shapes.
134+
PTUs represent an amount of model processing capacity. Similar to your computer or databases, different workloads or requests to the model will consume different amounts of underlying processing capacity. The conversion from throughput needs to PTUs can be approximated using historical token usage data or call shape estimations (input tokens, output tokens, and requests per minute) as outlined in our [performance and latency](../how-to/latency.md) documentation. To simplify this process, you can use the [Azure OpenAI Capacity calculator](https://oai.azure.com/portal/calculator) to size specific workload shapes.
136135

137136
A few high-level considerations:
138137
- Generations require more capacity than prompts
@@ -165,16 +164,16 @@ For provisioned deployments, we use a variation of the leaky bucket algorithm to
165164
1. When a request is made:
166165

167166
a. When the current utilization is above 100%, the service returns a 429 code with the `retry-after-ms` header set to the time until utilization is below 100%
167+
168+
b. Otherwise, the service estimates the incremental change to utilization required to serve the request by combining the prompt tokens, less any cacehd tokens, and the specified `max_tokens` in the call. A customer can receive up to a 100% discount on their prompt tokens depending on the size of their cached tokens. If the `max_tokens` parameter is not specified, the service estimates a value. This estimation can lead to lower concurrency than expected when the number of actual generated tokens is small. For highest concurrency, ensure that the `max_tokens` value is as close as possible to the true generation size.
169+
170+
1. When a request finishes, we now know the actual compute cost for the call. To ensure an accurate accounting, we correct the utilization using the following logic:
168171

169-
b. Otherwise, the service estimates the incremental change to utilization required to serve the request by combining prompt tokens and the specified `max_tokens` in the call. For requests that include at least 1024 cached tokens, the cached tokens are subtracted from the prompt token value. A customer can receive up to a 100% discount on their prompt tokens depending on the size of their cached tokens. If the `max_tokens` parameter is not specified, the service estimates a value. This estimation can lead to lower concurrency than expected when the number of actual generated tokens is small. For highest concurrency, ensure that the `max_tokens` value is as close as possible to the true generation size.
170-
171-
1. When a request finishes, we now know the actual compute cost for the call. To ensure an accurate accounting, we correct the utilization using the following logic:
172-
173-
a. If the actual > estimated, then the difference is added to the deployment's utilization.
174-
175-
b. If the actual < estimated, then the difference is subtracted.
176-
177-
1. The overall utilization is decremented down at a continuous rate based on the number of PTUs deployed.
172+
a. If the actual > estimated, then the difference is added to the deployment's utilization.
173+
174+
b. If the actual < estimated, then the difference is subtracted.
175+
176+
1. The overall utilization is decremented down at a continuous rate based on the number of PTUs deployed.
178177

179178
> [!NOTE]
180179
> Calls are accepted until utilization reaches 100%. Bursts just over 100% may be permitted in short periods, but over time, your traffic is capped at 100% utilization.
@@ -184,12 +183,30 @@ For provisioned deployments, we use a variation of the leaky bucket algorithm to
184183

185184
#### How many concurrent calls can I have on my deployment?
186185

187-
The number of concurrent calls you can achieve depends on each call's shape (prompt size, max_token parameter, etc.). The service continues to accept calls until the utilization reach 100%. To determine the approximate number of concurrent calls, you can model out the maximum requests per minute for a particular call shape in the [capacity calculator](https://oai.azure.com/portal/calculator). If the system generates less than the number of samplings tokens like max_token, it will accept more requests.
186+
The number of concurrent calls you can achieve depends on each call's shape (prompt size, `max_tokens` parameter, etc.). The service continues to accept calls until the utilization reaches 100%. To determine the approximate number of concurrent calls, you can model out the maximum requests per minute for a particular call shape in the [capacity calculator](https://oai.azure.com/portal/calculator). If the system generates less than the number of output tokens set for the `max_tokens` parameter, then the provisioned deployment will accept more requests.
188187

189188
## What models and regions are available for provisioned throughput?
190189

190+
# [Global Provisioned Managed](#tab/global-ptum)
191+
192+
### Global provisioned managed model availability
193+
194+
[!INCLUDE [Provisioned Managed Global](../includes/model-matrix/provisioned-global.md)]
195+
196+
# [Data Zone Provisioned Managed](#tab/datazone-provisioned-managed)
197+
198+
### Data zone provisioned managed model availability
199+
200+
[!INCLUDE [Global data zone provisioned managed](../includes/model-matrix/datazone-provisioned-managed.md)]
201+
202+
# [Provisioned Managed](#tab/provisioned)
203+
204+
### Provisioned deployment model availability
205+
191206
[!INCLUDE [Provisioned](../includes/model-matrix/provisioned-models.md)]
192207

208+
---
209+
193210
> [!NOTE]
194211
> The provisioned version of `gpt-4` **Version:** `turbo-2024-04-09` is currently limited to text only.
195212

0 commit comments

Comments
 (0)