Skip to content

Commit d84fb33

Browse files
authored
Merge pull request #1084 from MicrosoftDocs/main
10/28/2024 PM Publish
2 parents 1dfe639 + df5a0f5 commit d84fb33

File tree

67 files changed

+1302
-243
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

67 files changed

+1302
-243
lines changed

articles/ai-services/translator/text-sdk-overview.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -236,8 +236,8 @@ const {TextTranslationClient } = require("@azure-rest/ai-translation-text").defa
236236
### [Python](#tab/python)
237237

238238
```python
239-
from azure.core.credentials import TextTranslationClient
240-
from azure-ai-translation-text import TextTranslationClient
239+
from azure.core.credentials import AzureKeyCredential
240+
from azure.ai.translation.text import TextTranslationClient
241241
```
242242

243243
---
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
---
2+
title: Understanding deployment types in Azure AI model inference
3+
titleSuffix: Azure AI services
4+
description: Learn how to use deployment types in Azure AI model deployments
5+
author: sdgilley
6+
manager: scottpolly
7+
ms.service: azure-ai-studio
8+
ms.topic: conceptual
9+
ms.date: 10/24/2024
10+
ms.author: fasantia
11+
ms.reviewer: fasantia
12+
ms.custom: github-universe-2024
13+
---
14+
15+
# Deployment types in Azure AI model inference
16+
17+
Azure AI model inference in Azure AI services provides customers with choices on the hosting structure that fits their business and usage patterns. The service offers two main types of deployment: **standard** and **provisioned**. Standard is offered with a global deployment option, routing traffic globally to provide higher throughput. Provisioned is also offered with a global deployment option, allowing customers to purchase and deploy provisioned throughput units across Azure global infrastructure.
18+
19+
All deployments can perform the exact same inference operations, however the billing, scale, and performance are substantially different. As part of your solution design, you need to make two key decisions:
20+
21+
- **Data residency needs**: global vs. regional resources
22+
- **Call volume**: standard vs. provisioned
23+
24+
Deployment types support varies by model and model provider.
25+
26+
## Global versus regional deployment types
27+
28+
For standard and provisioned deployments, you have an option of two types of configurations within your resource – **global** or **regional**. Global standard is the recommended starting point.
29+
30+
Global deployments use Azure's global infrastructure, dynamically route customer traffic to the data center with best availability for the customer’s inference requests. This means you get the highest initial throughput limits and best model availability with Global while still providing our uptime SLA and low latency. For high volume workloads above the specified usage tiers on standard and global standard, you may experience increased latency variation. For customers that require the lower latency variance at large workload usage, we recommend purchasing provisioned throughput.
31+
32+
Our global deployments are the first location for all new models and features. Customers with very large throughput requirements should consider our provisioned deployment offering.
33+
34+
## Standard
35+
36+
Standard deployments provide a pay-per-call billing model on the chosen model. Provides the fastest way to get started as you only pay for what you consume. Models available in each region as well as throughput may be limited.
37+
38+
Standard deployments are optimized for low to medium volume workloads with high burstiness. Customers with high consistent volume may experience greater latency variability.
39+
40+
Only Azure OpenAI models support this deployment type.
41+
42+
## Global standard
43+
44+
Global deployments are available in the same Azure AI services resources as nonglobal deployment types but allow you to use Azure's global infrastructure to dynamically route traffic to the data center with best availability for each request. Global standard provides the highest default quota and eliminates the need to load balance across multiple resources.
45+
46+
Customers with high consistent volume may experience greater latency variability. The threshold is set per model. For applications that require the lower latency variance at large workload usage, we recommend purchasing provisioned throughput if available.
47+
48+
## Global provisioned
49+
50+
Global deployments are available in the same Azure AI services resources as nonglobal deployment types but allow you to leverage Azure's global infrastructure to dynamically route traffic to the data center with best availability for each request. Global provisioned deployments provide reserved model processing capacity for high and predictable throughput using Azure global infrastructure.
51+
52+
Only Azure OpenAI models support this deployment type.
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
---
2+
title: Use the Azure AI model inference endpoint
3+
titleSuffix: Azure AI studio
4+
description: Learn about to use the Azure AI model inference endpoint and how to configure it.
5+
ms.service: azure-ai-studio
6+
ms.topic: conceptual
7+
author: sdgilley
8+
manager: scottpolly
9+
ms.date: 10/24/2024
10+
ms.author: sgilley
11+
ms.reviewer: fasantia
12+
ms.custom: github-universe-2024
13+
---
14+
15+
# Use the Azure AI model inference endpoint
16+
17+
Azure AI inference service in Azure AI services allows customers to consume the most powerful models from flagship model providers using a single endpoint and credentials. This means that you can switch between models and consume them from your application without changing a single line of code.
18+
19+
The article explains how models are organized inside of the service and how to use the inference endpoint to invoke them.
20+
21+
## Deployments
22+
23+
Azure AI model inference service makes models available using the **deployment** concept. **Deployments** are a way to give a model a name under certain configurations. Then, you can invoke such model configuration by indicating its name on your requests.
24+
25+
Deployments capture:
26+
27+
> [!div class="checklist"]
28+
> * A model name
29+
> * A model version
30+
> * A provisioning/capacity type<sup>1</sup>
31+
> * A content filtering configuration<sup>1</sup>
32+
> * A rate limiting configuration<sup>1</sup>
33+
34+
<sup>1</sup> Configurations may vary depending on the model you have selected.
35+
36+
An Azure AI services resource can have as many model deployments as needed and they don't incur in cost unless inference is performed for those models. Deployments are Azure resources and hence they are subject to Azure policies.
37+
38+
To learn more about how to create deployments see [Add and configure model deployments](../how-to/create-model-deployments.md).
39+
40+
## Azure AI inference endpoint
41+
42+
The Azure AI inference endpoint allows customers to use a single endpoint with the same authentication and schema to generate inference for the deployed models in the resource. This endpoint follows the [Azure AI model inference API](../../reference/reference-model-inference-api.md) which is supported by all the models in Azure AI model inference service.
43+
44+
You can see the endpoint URL and credentials in the **Overview** section. The endpoint usually has the form `https://<resource-name>.services.ai.azure.com/models`:
45+
46+
:::image type="content" source="../../media/ai-services/overview/overview-endpoint-and-key.png" alt-text="A screenshot showing how to get the URL and key associated with the resource." lightbox="../../media/ai-services/overview/overview-endpoint-and-key.png":::
47+
48+
### Routing
49+
50+
The inference endpoint routes requests to a given deployment by matching the parameter `name` inside of the request to the name of the deployment. This means that *deployments work as an alias of a given model under certain configurations*. This flexibility allows you to deploy a given model multiple times in the service but under different configurations if needed.
51+
52+
:::image type="content" source="../../media/ai-services/endpoint/endpoint-routing.png" alt-text="An illustration showing how routing works for a Meta-llama-3.2-8b-instruct model by indicating such name in the parameter 'model' inside of the payload request." lightbox="../../media/ai-services/endpoint/endpoint-routing.png":::
53+
54+
For example, if you create a deployment named `Mistral-large`, then such deployment can be invoked as:
55+
56+
[!INCLUDE [code-create-chat-completion](../../includes/ai-services/code-create-chat-completion.md)]
57+
58+
> [!TIP]
59+
> Deployment routing is not case sensitive.
60+
61+
### Supported languages and SDKs
62+
63+
All models deployed in Azure AI model inference service support the [Azure AI model inference API](https://aka.ms/aistudio/modelinference) and its associated family of SDKs, which are available in the following languages:
64+
65+
| Language | Documentation | Package | Examples |
66+
|------------|---------|-----|-------|
67+
| C# | [Reference](https://aka.ms/azsdk/azure-ai-inference/csharp/reference) | [azure-ai-inference (NuGet)](https://www.nuget.org/packages/Azure.AI.Inference/) | [C# examples](https://aka.ms/azsdk/azure-ai-inference/csharp/samples) |
68+
| Java | [Reference](https://aka.ms/azsdk/azure-ai-inference/java/reference) | [azure-ai-inference (Maven)](https://central.sonatype.com/artifact/com.azure/azure-ai-inference/) | [Java examples](https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/ai/azure-ai-inference/src/samples) |
69+
| JavaScript | [Reference](https://aka.ms/AAp1kxa) | [@azure/ai-inference (npm)](https://www.npmjs.com/package/@azure/ai-inference) | [JavaScript examples](https://github.com/Azure/azure-sdk-for-js/tree/main/sdk/ai/ai-inference-rest/samples) |
70+
| Python | [Reference](https://aka.ms/azsdk/azure-ai-inference/python/reference) | [azure-ai-inference (PyPi)](https://pypi.org/project/azure-ai-inference/) | [Python examples](https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/ai/azure-ai-inference/samples) |
71+
72+
## Azure OpenAI inference endpoint
73+
74+
Azure OpenAI models also support the Azure OpenAI API. This API exposes the full capabilities of OpenAI models and supports additional features like assistants, threads, files, and batch inference.
75+
76+
Each OpenAI model deployment has its own URL associated with such deployment under the Azure OpenAI inference endpoint. However, the same authentication mechanism can be used to consume it. URLs are usually in the form of `https://<resource-name>.openai.azure.com/openai/deployments/<model-deployment-name>`. Learn more in the reference page for [Azure OpenAI API](../../../ai-services/openai/reference.md)
77+
78+
:::image type="content" source="../../media/ai-services/endpoint/endpoint-openai.png" alt-text="An illustration showing how Azure OpenAI deployments contain a single URL for each deployment." lightbox="../../media/ai-services/endpoint/endpoint-openai.png":::
79+
80+
Each deployment has a URL that is the concatenations of the **Azure OpenAI** base URL and the route `/deployments/<model-deployment-name>`.
81+
82+
> [!IMPORTANT]
83+
> There is no routing mechanism for the Azure OpenAI endpoint, as each URL is exclusive for each model deployment.
84+
85+
### Supported languages and SDKs
86+
87+
The Azure OpenAI endpoint is supported by the **OpenAI SDK (`AzureOpenAI` class)** and **Azure OpenAI SDKs**, which are available in multiple languages:
88+
89+
| Language | Source code | Package | Examples |
90+
|------------|---------|-----|-------|
91+
| C# | [Source code](https://github.com/Azure/azure-sdk-for-net/tree/main/sdk/openai/Azure.AI.OpenAI) | [Azure.AI.OpenAI (NuGet)](https://www.nuget.org/packages/Azure.AI.OpenAI/) | [C# examples](https://github.com/Azure/azure-sdk-for-net/blob/main/sdk/openai/Azure.AI.OpenAI/tests/Samples) |
92+
| Go | [Source code](https://github.com/Azure/azure-sdk-for-go/tree/main/sdk/ai/azopenai) | [azopenai (Go)](https://pkg.go.dev/github.com/Azure/azure-sdk-for-go/sdk/ai/azopenai)| [Go examples](https://pkg.go.dev/github.com/Azure/azure-sdk-for-go/sdk/ai/azopenai#pkg-examples) |
93+
| Java | [Source code](https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/openai/azure-ai-openai) | [azure-ai-openai (Maven)](https://central.sonatype.com/artifact/com.azure/azure-ai-openai/) | [Java examples](https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/openai/azure-ai-openai/src/samples) |
94+
| JavaScript | [Source code](https://github.com/Azure/azure-sdk-for-js/tree/main/sdk/openai/openai) | [@azure/openai (npm)](https://www.npmjs.com/package/@azure/openai) | [JavaScript examples](https://github.com/Azure/azure-sdk-for-js/tree/main/sdk/openai/openai/samples/) |
95+
| Python | [Source code](https://github.com/openai/openai-python) | [openai (PyPi)](https://pypi.org/project/openai/) | [Python examples](https://github.com/openai/openai-cookbook) |
96+
97+
## Next steps
98+
99+
- [Deployment types](deployment-types.md)
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
---
2+
title: Azure AI model inference quotas and limits
3+
titleSuffix: Azure AI services
4+
description: Quick reference, detailed description, and best practices on the quotas and limits for the Azure AI models service in Azure AI services.
5+
ms.service: azure-ai-studio
6+
ms.custom: github-universe-2024
7+
ms.topic: conceptual
8+
author: sdgilley
9+
manager: scottpolly
10+
ms.date: 10/24/2024
11+
ms.author: sgilley
12+
ms.reviewer: fasantia
13+
---
14+
15+
# Azure AI model inference quotas and limits
16+
17+
This article contains a quick reference and a detailed description of the quotas and limits for Azure AI model's inference in Azure AI services. For quotas and limits specific to the Azure OpenAI Service, see [Quota and limits in the Azure OpenAI service](../../../ai-services/openai/quotas-limits.md).
18+
19+
## Quotas and limits reference
20+
21+
The following sections provide you with a quick guide to the default quotas and limits that apply to Azure AI model's inference service in Azure AI services:
22+
23+
### Resource limits
24+
25+
| Limit name | Limit value |
26+
|--|--|
27+
| Azure AI Services resources per region per Azure subscription | 30 |
28+
| Max model deployments per resource | 32 |
29+
30+
### Rate limits
31+
32+
| Limit name | Limit value |
33+
| ---------- | ----------- |
34+
| Tokens per minute (Azure OpenAI models) | Varies per model and SKU. See [limits for Azure OpenAI](../../../ai-services/openai/quotas-limits.md). |
35+
| Tokens per minute (rest of models) | 200.000 |
36+
| Requests per minute (Azure OpenAI models) | Varies per model and SKU. See [limits for Azure OpenAI](../../../ai-services/openai/quotas-limits.md). |
37+
| Requests per minute (rest of models) | 1.000 |
38+
39+
### Other limits
40+
41+
| Limit name | Limit value |
42+
|--|--|
43+
| Max number of custom headers in API requests<sup>1</sup> | 10 |
44+
45+
<sup>1</sup> Our current APIs allow up to 10 custom headers, which are passed through the pipeline, and returned. We have noticed some customers now exceed this header count resulting in HTTP 431 errors. There is no solution for this error, other than to reduce header volume. **In future API versions we will no longer pass through custom headers**. We recommend customers not depend on custom headers in future system architectures.
46+
47+
## Usage tiers
48+
49+
Global Standard deployments use Azure's global infrastructure, dynamically routing customer traffic to the data center with best availability for the customer’s inference requests. This enables more consistent latency for customers with low to medium levels of traffic. Customers with high sustained levels of usage might see more variability in response latency.
50+
51+
The Usage Limit determines the level of usage above which customers might see larger variability in response latency. A customer’s usage is defined per model and is the total tokens consumed across all deployments in all subscriptions in all regions for a given tenant.
52+
53+
## General best practices to remain within rate limits
54+
55+
To minimize issues related to rate limits, it's a good idea to use the following techniques:
56+
57+
- Implement retry logic in your application.
58+
- Avoid sharp changes in the workload. Increase the workload gradually.
59+
- Test different load increase patterns.
60+
- Increase the quota assigned to your deployment. Move quota from another deployment, if necessary.
61+
62+
### Request increases to the default quotas and limits
63+
64+
Quota increase requests can be submitted and evaluated per request. [Submit a service request](../../../ai-services/cognitive-services-support-options.md?context=/azure/ai-studio/context/context).
65+
66+
## Next steps
67+
68+
* Learn more about the [Azure AI model inference service](../model-inference.md)

0 commit comments

Comments
 (0)