|
| 1 | +--- |
| 2 | +title: GenAI gateway capabilities in Azure API Management |
| 3 | +description: Learn about policies and features in Azure API Management that support GenAI gateway capabilities, such as token limiting, semantic caching, and more. |
| 4 | +services: api-management |
| 5 | +author: dlepow |
| 6 | + |
| 7 | +ms.service: api-management |
| 8 | +ms.topic: concept-a |
| 9 | +ms.date: 07/16/2024 |
| 10 | +ms.author: danlep |
| 11 | +--- |
| 12 | + |
| 13 | +# Overview of generative AI gateway capabilities in Azure API Management |
| 14 | + |
| 15 | +[!INCLUDE [api-management-availability-all-tiers](../../includes/api-management-availability-all-tiers.md)] |
| 16 | + |
| 17 | +While generative AI services and their APIs provide powerful capabilities for understanding, interpreting, and generating human-like text and images, they can also impose significant management and security challenges. This article provides an introduction to how Azure API Management can help you manage generative AI APIs, such as those provided by [Azure OpenAI Service](../ai-services/openai/overview.md), and how you can use policies and other features to enhance security, performance, and reliability of your intelligent apps. Collectively, these capabilities are referred to as the *GenAI gateway*. |
| 18 | + |
| 19 | + |
| 20 | +## Managing tokens |
| 21 | + |
| 22 | +One of the main resources you have in Azure OpenAI Service is tokens. Azure OpenAI assigns quota for your model deployments expressed in tokens-per-minute (TPM) which is then distributed across your model consumers that can be represented by different applications, developer teams, departments within the company, etc. |
| 23 | + |
| 24 | +Starting with a single application integration, Azure makes it easy to connect your app to Azure OpenAI Service. Your intelligent application connects to Azure OpenAI directly using an API key with a TPM limit configured directly on the model deployment level. However, when you start growing your application portfolio, you are presented with multiple apps calling single or even multiple Azure OpenAI endpoints deployed as Pay-as-you-go or Provisioned Throughput Units (PTUs) instances. That comes with certain challenges: |
| 25 | + |
| 26 | +* How can we track token usage across multiple applications? How can we do cross charges for multiple applications/teams that use Azure OpenAI models? |
| 27 | +* How can we make sure that a single app does not consume the whole TPM quota, leaving other apps with no option to use Azure OpenAI models? |
| 28 | +* How can we make sure that the API key is securely distributed across multiple applications? |
| 29 | +* How can we distribute load across multiple Azure OpenAI endpoints? How can we make sure that PTUs are used first before falling back to Pay-as-you-go instances? |
| 30 | + |
| 31 | +## Token limit policy |
| 32 | + |
| 33 | +Configure the [Azure OpenAI token limit policy](azure-openai-token-limit-policy.md) to manage and enforce limits per API consumer based on the usage of Azure OpenAI tokens. With this policy you can set limits, expressed in tokens-per-minute (TPM). |
| 34 | + |
| 35 | +This policy provides flexibility to assign token-based limits on any counter key, such as subscription key, IP address or any other arbitrary key defined through a policy expression. The policy also enables pre-calculation of prompt tokens on the Azure API Management side, minimizing unnecessary requests to the Azure OpenAI backend if the prompt already exceeds the limit. |
| 36 | + |
| 37 | +The following basic example demonstrates how to set a TPM limit of 500 per subscription key: |
| 38 | + |
| 39 | +```xml |
| 40 | +<azure-openai-token-limit counter-key="@(context.Subscription.Id)" |
| 41 | + tokens-per-minute="500" estimate-prompt-tokens="false" remaining-tokens-variable-name="remainingTokens"> |
| 42 | +</azure-openai-token-limit> |
| 43 | +``` |
| 44 | + |
| 45 | +## Emit token metric policy |
| 46 | + |
| 47 | +The [Azure OpenAI emit token metric](azure-openai-emit-token-metric-policy.md) policy sends metrics to Application Insights about consumption of large language model tokens through Azure OpenAI Service APIs. The policy helps provide an overview of the utilization of Azure OpenAI Service models across multiple applications or API consumers. This policy could be useful for chargeback scenarios, monitoring, and capacity planning. |
| 48 | + |
| 49 | +This policy captures prompt, completions, and total token usage metrics and sends them to an Application Insights namespace of your choice. Moreover, you can configure or select from predefined dimensions to split token usage metrics, enabling granular analysis by subscription ID, IP address, or a custom dimension of your choice. |
| 50 | + |
| 51 | +For example, the following policy sends metrics to Application Insights split by client IP address, API, and user: |
| 52 | + |
| 53 | +```xml |
| 54 | +<azure-openai-emit-token-metric namespace="openai"> |
| 55 | + <dimension name="Client IP" value="@(context.Request.IpAddress)" /> |
| 56 | + <dimension name="API ID" value="@(context.Api.Id)" /> |
| 57 | + <dimension name="User ID" value="@(context.Request.Headers.GetValueOrDefault("x-user-id", "N/A"))" /> |
| 58 | +</azure-openai-emit-token-metric> |
| 59 | +``` |
| 60 | + |
| 61 | +## Load balancer and circuit breaker |
| 62 | + |
| 63 | +One of the challenges when building intelligent applications is to ensure that the application is resilient to backen failures and can handle high loads. By configuring your Azure OpenAI Service endpoints using [backends](backends.md) in Azure API Management, you can balance the load across them. You can also define circuit breaker rules to stop forwarding requests to the Azure OpenAI backends if they're not responsive. |
| 64 | + |
| 65 | +The backend [load balancer](backends.md#backends-in-api-management) supports round-robin, weighted, and priority-based load balancing, giving you flexibility to define a load distribution strategy that meets your specific requirements. For example, define priorities within the load balancer configuration to ensure optimal utilization of specific Azure OpenAI endpoints, particularly those purchased as PTUs. |
| 66 | + |
| 67 | + |
| 68 | + |
| 69 | +## Labs and samples |
| 70 | + |
| 71 | +* [Labs for the GenAI gateway capabilities of Azure API Management](https://github.com/Azure-Samples/AI-Gateway) |
| 72 | + |
| 73 | +## Related content |
| 74 | + |
| 75 | +* [Blog: Introducing GenAI capabilities in Azure API Management](https://techcommunity.microsoft.com/t5/azure-integration-services-blog/introducing-genai-gateway-capabilities-in-azure-api-management/ba-p/4146525) |
| 76 | +* [Designing and implementing a gateway solution with Azure OpenAI resources](/ai/playbook/technology-guidance/generative-ai/dev-starters/genai-gateway/) |
| 77 | +* [Training: Fundamental AI concepts](/training/modules/get-started-ai-fundamentals/) |
0 commit comments