Merge pull request #275739 from mrbullwinkle/mrb_05_19_2024_global_endpoints

ShannonLeavitt · web-flow · commit 5158dd6533f1 · 2024-05-21T09:34:38.000-06:00
[Azure OpenAI] Global standard
diff --git a/articles/ai-services/openai/how-to/create-resource.md b/articles/ai-services/openai/how-to/create-resource.md
@@ -7,7 +7,7 @@ manager: nitinme
 ms.service: azure-ai-openai
 ms.custom: devx-track-azurecli, build-2023, build-2023-dataai, devx-track-azurepowershell
 ms.topic: how-to
-ms.date: 08/25/2023
+ms.date: 05/20/2024
 zone_pivot_groups: openai-create-resource
 author: mrbullwinkle
 ms.author: mbullwin
diff --git a/articles/ai-services/openai/how-to/deployment-types.md b/articles/ai-services/openai/how-to/deployment-types.md
@@ -0,0 +1,99 @@
+---
+title: Understanding Azure OpenAI Service deployment types
+titleSuffix: Azure AI services
+description: Learn how to use Azure OpenAI deployment types | Global-Standard | Standard | Provisioned.
+#services: cognitive-services
+author: mrbullwinkle
+manager: nitinme
+ms.service: azure-ai-openai
+ms.topic: how-to
+ms.date: 05/19/2024
+ms.author: mbullwin
+---
+
+# Azure OpenAI deployment types
+
+Azure OpenAI provides customers with choices on the hosting structure that fits their business and usage patterns. The service offers two main types of deployment: **standard** and **provisioned**. Standard is offered with a global deployment option, routing traffic globally to provide higher throughput. All deployments can perform the exact same inference operations, however the billing, scale and performance are substantially different. As part of your solution design, you will need to make two key decisions:
+
+- **Data residency needs**: global vs. regional resources  
+- **Call volume**: standard vs. provisioned
+
+## Global versus regional deployment types
+
+For standard deployments you have an option of two types of configurations within your resource – **global** or **regional**. Global standard is the recommended starting point for development and experimentation. Global deployments leverage Azure's global infrastructure, dynamically route customer traffic to the data center with best availability for the customer’s inference requests. With global deployments there are higher initial throughput limits, though your latency may vary at high usage levels. For customers that require the lower latency variance at large workload usage, we recommend purchasing provisioned throughput.
+
+Our global deployments will be the first location for all new models and features. Customers with very large throughput requirements should consider our provisioned deployment offering.
+
+## Deployment types
+
+Azure OpenAI offers three types of deployments. These provide a varied level of capabilities that provide trade-offs on: throughput, SLAs, and price. Below is a summary of the options followed by a deeper description of each.
+
+| **Offering** | **Global-Standard** <sup>**1**</sup> | **Standard** | **Provisioned**  |
+|---|---|---|---|
+| **Best suited for**      | Applications that don’t require data residency. Recommended starting place for customers. | For customers with data residency requirements. Optimized for low to medium volume. | Real-time scoring for large consistent volume. Includes the highest commitments and limits.|
+| **How it works**         | Traffic may be routed anywhere in the world | | |
+| **Getting started**      | [Model deployment](./create-resource.md) | [Model deployment](./create-resource.md) | [Provisioned onboarding](./provisioned-throughput-onboarding.md) |
+| **Cost**                 | [Baseline](https://azure.microsoft.com/pricing/details/cognitive-services/openai-service/) | [Regional Pricing](https://azure.microsoft.com/pricing/details/cognitive-services/openai-service/) | May experience cost savings for consistent usage |
+| **What you get**         | Easy access to all new models with highest default pay-per-call limits.<br><br> Customers with high volume usage may see higher latency variability | Easy access with [SLA on availability](https://azure.microsoft.com/support/legal/sla/). Optimized for low to medium volume workloads with high burstiness. <br><br>Customers with high consistent volume may experience greater latency variability. | Regional access with very high & predictable throughput. Determine throughput per PTU using the provided [capacity calculator](./provisioned-throughput-onboarding.md#estimate-provisioned-throughput-and-cost) |
+| **What you don’t get**   | Data residency guarantees | High volume w/consistent low latency | Pay-per-call flexibility |
+| **Per-call Latency**     | Optimized for real-time calling & low to medium volume usage. Customers with high volume usage may see higher latency variability. Threshold set per model | Optimized for real-time calling & low to medium volume usage. Customers with high volume usage may see higher latency variability. Threshold set per model | Optimized for real-time. |
+| **Sku Name in code**     |    `GlobalStandard`               | `Standard`   |      `ProvisionedManaged`       |
+| **Billing model**        | Pay-per-token | Pay-per-token | Monthly Commitments |
+
+<sup>**1**</sup> Global-Standard deployment type is currently in preview.
+
+## Provisioned
+
+Provisioned deployments allow you to specify the amount of throughput you require in a deployment. The service then allocates the necessary model processing capacity and ensures it's ready for you. Throughput is defined in terms of provisioned throughput units (PTU) which is a normalized way of representing the throughput for your deployment. Each model-version pair requires different amounts of PTU to deploy and provide different amounts of throughput per PTU. Learn more from our [Provisioned throughput concepts article](../concepts/provisioned-throughput.md).
+
+## Standard
+
+Standard deployments provide a pay-per-call billing model on the chosen model. Provides the fastest way to get started as you only pay for what you consume. Models available in each region as well as throughput may be limited.  
+
+Standard deployments are optimized for low to medium volume workloads with high burstiness. Customers with high consistent volume may experience greater latency variability.
+
+## Global standard (preview)
+
+Global deployments are available in the same Azure OpenAI resources as non-global offers but allow you to leverage Azure's global infrastructure to dynamically route traffic to the data center with best availability for each request.  Global standard will provide the highest default quota for new models and eliminates the need to load balance across multiple resources.  
+
+The deployment type is optimized for low to medium volume workloads with high burstiness. Customers with high consistent volume may experience greater latency variability. The threshold is set per model. See the [quotas page to learn more](./quota.md).  
+
+For customers that require the lower latency variance at large workload usage, we recommend purchasing provisioned throughput.
+
+### How to disable access to global deployments in your subscription
+
+Azure Policy helps to enforce organizational standards and to assess compliance at-scale. Through its compliance dashboard, it provides an aggregated view to evaluate the overall state of the environment, with the ability to drill down to the per-resource, per-policy granularity. It also helps to bring your resources to compliance through bulk remediation for existing resources and automatic remediation for new resources. [Learn more about Azure Policy and specific built-in controls for AI services](/azure/ai-services/security-controls-policy).
+
+You can use the following policy to disable access to Azure OpenAI global standard deployments.
+
+```json
+{
+    "mode": "All",
+    "policyRule": {
+        "if": {
+            "allOf": [
+                {
+                    "field": "type",
+                    "equals": "Microsoft.CognitiveServices/accounts/deployments"
+                },
+                {
+                    "field": "Microsoft.CognitiveServices/accounts/deployments/sku.name",
+                    "equals": "GlobalStandard"
+                }
+            ]
+        }
+    }
+}
+```
+
+## Deploy models
+
+:::image type="content" source="../media/deployment-types/deploy-models.png" alt-text="Screenshot that shows the model deployment dialog in Azure OpenAI Studio with three deployment types highlighted." lightbox="../media/deployment-types/deploy-models.png":::
+
+To learn about creating resources and deploying models refer to the [resource creation guide](./create-resource.md).
+
+## See also
+
+- [Quotas & limits](./quota.md)
+- [Provisioned throughput units (PTU) onboarding](./provisioned-throughput-onboarding.md)
+- [Provisioned throughput units (PTU) getting started](./provisioned-get-started.md)
diff --git a/articles/ai-services/openai/includes/create-resource-cli.md b/articles/ai-services/openai/includes/create-resource-cli.md
@@ -7,7 +7,7 @@ manager: nitinme
 ms.service: azure-ai-openai
 ms.custom: devx-track-azurecli
 ms.topic: include
-ms.date: 08/25/2023
+ms.date: 05/20/2024
 ---
 
 ## Prerequisites
@@ -80,7 +80,7 @@ az cognitiveservices account keys list \
 
 ## Deploy a model
 
-To deploy a model, use the [az cognitiveservices account deployment create](/cli/azure/cognitiveservices/account/deployment?view=azure-cli-latest&preserve-view=true#az-cognitiveservices-account-deployment-create) command. In the following example, you deploy an instance of the `text-embedding-ada-002` model and give it the name _MyModel_. When you try the example, update the code to use your values for the resource group and resource. You don't need to change the `model-version`, `model-format` or `sku-capacity`, and `sku-name` values. 
+To deploy a model, use the [az cognitiveservices account deployment create](/cli/azure/cognitiveservices/account/deployment?view=azure-cli-latest&preserve-view=true#az-cognitiveservices-account-deployment-create) command. In the following example, you deploy an instance of the `text-embedding-ada-002` model and give it the name _MyModel_. When you try the example, update the code to use your values for the resource group and resource. You don't need to change the `model-version`, `model-format` or `sku-capacity`, and `sku-name` values.
 
 ```azurecli
 az cognitiveservices account deployment create \
@@ -94,6 +94,9 @@ az cognitiveservices account deployment create \
 --sku-name "Standard"
 ```
 
+`--sku-name` accepts the following deployment types: `Standard`, `GlobalStandard`, and `ProvisionedManaged`.  Learn more about [deployment type options](../how-to/deployment-types.md).
+
+
 > [!IMPORTANT]
 > When you access the model via the API, you need to refer to the deployment name rather than the underlying model name in API calls, which is one of the [key differences](../how-to/switching-endpoints.yml) between OpenAI and Azure OpenAI. OpenAI only requires the model name. Azure OpenAI always requires deployment name, even when using the model parameter. In our docs, we often have examples where deployment names are represented as identical to model names to help indicate which model works with a particular API endpoint. Ultimately your deployment names can follow whatever naming convention is best for your use case.
 
diff --git a/articles/ai-services/openai/includes/create-resource-portal.md b/articles/ai-services/openai/includes/create-resource-portal.md
@@ -6,7 +6,7 @@ description: Learn how to use the Azure portal to create an Azure OpenAI resourc
 manager: nitinme
 ms.service: azure-ai-openai
 ms.topic: include
-ms.date: 01/30/2024
+ms.date: 05/20/2024
 ---
 
 ## Prerequisites
@@ -111,6 +111,7 @@ To deploy a model, follow these steps:
    |---|---|
    | **Select a model** | Model availability varies by region. For a list of available models per region, see [Model summary table and region availability](../concepts/models.md#model-summary-table-and-region-availability). |
    | **Deployment name** | Choose a name carefully. The deployment name is used in your code to call the model by using the client libraries and the REST APIs. |
+   |**Deployment type** | **Standard**, **Global-Standard**, **Provisioned-Managed**. Learn more about [deployment type options](../how-to/deployment-types.md). |  
    | **Advanced options** (Optional) | You can set optional advanced settings, as needed for your resource. <br> - For the **Content Filter**, assign a content filter to your deployment.<br> - For the **Tokens per Minute Rate Limit**, adjust the Tokens per Minute (TPM) to set the effective rate limit for your deployment. You can modify this value at any time by using the [**Quotas**](../how-to/quota.md) menu. [**Dynamic Quota**](../how-to/dynamic-quota.md) allows you to take advantage of more quota when extra capacity is available. |
 
 5. Select a model from the dropdown list.
diff --git a/articles/ai-services/openai/includes/create-resource-powershell.md b/articles/ai-services/openai/includes/create-resource-powershell.md
@@ -7,7 +7,7 @@ manager: nitinme
 ms.service: azure-ai-openai
 ms.custom: devx-track-azurepowershell
 ms.topic: include
-ms.date: 08/28/2023
+ms.date: 05/20/2024
 ---
 
 ## Prerequisites
@@ -89,6 +89,8 @@ $sku = New-Object -TypeName "Microsoft.Azure.Management.CognitiveServices.Models
 New-AzCognitiveServicesAccountDeployment -ResourceGroupName OAIResourceGroup -AccountName MyOpenAIResource -Name MyModel -Properties $properties -Sku $sku
 ```
 
+The `Name` property of the `$sku` variable accepts the following deployment types: `Standard`, `GlobalStandard`, and `ProvisionedManaged`. Learn more about [deployment type options](../how-to/deployment-types.md).
+
 > [!IMPORTANT]
 > When you access the model via the API, you need to refer to the deployment name rather than the underlying model name in API calls, which is one of the [key differences](../how-to/switching-endpoints.yml) between OpenAI and Azure OpenAI. OpenAI only requires the model name. Azure OpenAI always requires deployment name, even when using the model parameter. In our docs, we often have examples where deployment names are represented as identical to model names to help indicate which model works with a particular API endpoint. Ultimately your deployment names can follow whatever naming convention is best for your use case.
 
diff --git a/articles/ai-services/openai/media/deployment-types/deploy-models.png b/articles/ai-services/openai/media/deployment-types/deploy-models.png
diff --git a/articles/ai-services/openai/quotas-limits.md b/articles/ai-services/openai/quotas-limits.md
@@ -10,7 +10,7 @@ ms.custom:
   - ignite-2023
   - references_regions
 ms.topic: conceptual
-ms.date: 02/27/2024
+ms.date: 05/19/2024
 ms.author: mbullwin
 ---
 
@@ -50,6 +50,31 @@ The following sections provide you with a quick guide to the default quotas and
 
 [!INCLUDE [Quota](includes/model-matrix/quota.md)]
 
+## gpt-4o rate limits
+
+`gpt-4o` introduces rate limit tiers with higher limits for certain customer types.
+
+### gpt-4o global standard
+
+> [!NOTE]
+> The [global standard model deployment type](./how-to/deployment-types.md#deployment-types) is currently in public preview.
+
+|Tier| Quota Limit in tokens per minute (TPM) | Requests per minute |
+|---|:---:|:---:|
+|Enterprise agreement | 10 M | 60 K |
+|Default | 450 K | 2.7 K |
+
+M = million | K = thousand
+
+### gpt-4o standard
+
+|Tier| Quota Limit in tokens per minute (TPM) | Requests per minute |
+|---|:---:|:---:|
+|Enterprise agreement | 1 M | 6 K |
+|Default | 150 K | 900 |
+
+M = million | K = thousand
+
 ### General best practices to remain within rate limits
 
 To minimize issues related to rate limits, it's a good idea to use the following techniques:
diff --git a/articles/ai-services/openai/toc.yml b/articles/ai-services/openai/toc.yml
@@ -8,6 +8,9 @@ items:
       href: overview.md
     - name: Quotas and limits
       href: quotas-limits.md
+    - name: Deployment types
+      href: ./how-to/deployment-types.md
+      displayName: global, Global, globalstandard, global-standard, Global-Standard, standard, provisioned 
     - name: Models
       href: ./concepts/models.md
     - name: Model retirements