Skip to content

Commit b4342eb

Browse files
authored
Merge pull request #299427 from MicrosoftDocs/main
5/6/2025 PM Publish
2 parents 5a08043 + 8f7d579 commit b4342eb

31 files changed

+209
-133
lines changed

articles/api-management/azure-openai-api-from-specification.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ ms.service: azure-api-management
55
author: dlepow
66
ms.author: danlep
77
ms.topic: how-to
8-
ms.date: 04/01/2025
8+
ms.date: 04/30/2025
99
ms.collection: ce-skilling-ai-copilot
1010
ms.custom: template-how-to, build-2024
1111
---
@@ -65,11 +65,12 @@ To import an Azure OpenAI API to API Management:
6565

6666
For example, if your API Management gateway endpoint is `https://contoso.azure-api.net`, set a **Base URL** similar to `https://contoso.azure-api.net/my-openai-api/openai`.
6767
1. Optionally select one or more products to associate with the API. Select **Next**.
68-
1. On the **Policies** tab, optionally enable policies to monitor and manage Azure OpenAI API token consumption. You can also set or edit policies later.
68+
1. On the **Policies** tab, optionally enable policies to help monitor and manage the API. You can also set or edit policies later.
6969

7070
If selected, enter settings or accept defaults that define the following policies (see linked articles for prerequisites and configuration details):
7171
* [Manage token consumption](azure-openai-token-limit-policy.md)
7272
* [Track token usage](azure-openai-emit-token-metric-policy.md)
73+
* [Enable semantic caching of responses](azure-openai-enable-semantic-caching.md)
7374

7475
Select **Review + Create**.
7576
1. After settings are validated, select **Create**.

articles/api-management/azure-openai-enable-semantic-caching.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ ms.collection: ce-skilling-ai-copilot
1818
Enable semantic caching of responses to Azure OpenAI API requests to reduce bandwidth and processing requirements imposed on the backend APIs and lower latency perceived by API consumers. With semantic caching, you can return cached responses for identical prompts and also for prompts that are similar in meaning, even if the text isn't the same. For background, see [Tutorial: Use Azure Cache for Redis as a semantic cache](../redis/tutorial-semantic-cache.md).
1919

2020
> [!NOTE]
21-
> The configuration steps in this article enable semantic caching for Azure OpenAI APIs. These steps can be generalized to enable semantic caching for corresponding large language model (LLM) APIs available through the [Azure AI Model Inference API](/azure/ai-studio/reference/reference-model-inference-api).
21+
> The configuration steps in this article enable semantic caching for Azure OpenAI APIs. These steps can be generalized to enable semantic caching for corresponding large language model (LLM) APIs available through the [Azure AI Model Inference API](/azure/ai-studio/reference/reference-model-inference-api) or with OpenAI-compatible models served through third-party inference providers.
2222
2323
## Prerequisites
2424

articles/api-management/azure-openai-semantic-cache-lookup-policy.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ ms.collection: ce-skilling-ai-copilot
99
ms.custom:
1010
- build-2024
1111
ms.topic: reference
12-
ms.date: 12/13/2024
12+
ms.date: 04/29/2025
1313
ms.author: danlep
1414
---
1515

@@ -22,7 +22,7 @@ Use the `azure-openai-semantic-cache-lookup` policy to perform cache lookup of r
2222
> [!NOTE]
2323
> * This policy must have a corresponding [Cache responses to Azure OpenAI API requests](azure-openai-semantic-cache-store-policy.md) policy.
2424
> * For prerequisites and steps to enable semantic caching, see [Enable semantic caching for Azure OpenAI APIs in Azure API Management](azure-openai-enable-semantic-caching.md).
25-
> * Currently, this policy is in preview.
25+
2626

2727
[!INCLUDE [api-management-policy-generic-alert](../../includes/api-management-policy-generic-alert.md)]
2828

@@ -45,10 +45,10 @@ Use the `azure-openai-semantic-cache-lookup` policy to perform cache lookup of r
4545

4646
| Attribute | Description | Required | Default |
4747
| ----------------- | ------------------------------------------------------ | -------- | ------- |
48-
| score-threshold | Similarity score threshold used to determine whether to return a cached response to a prompt. Value is a decimal between 0.0 and 1.0. [Learn more](../redis/tutorial-semantic-cache.md#change-the-similarity-threshold). | Yes | N/A |
48+
| score-threshold | Similarity score threshold used to determine whether to return a cached response to a prompt. Value is a decimal between 0.0 and 1.0. Smaller values represent greater semantic similarity. [Learn more](../redis/tutorial-semantic-cache.md#change-the-similarity-threshold). | Yes | N/A |
4949
| embeddings-backend-id | [Backend](backends.md) ID for OpenAI embeddings API call. | Yes | N/A |
5050
| embeddings-backend-auth | Authentication used for Azure OpenAI embeddings API backend. | Yes. Must be set to `system-assigned`. | N/A |
51-
| ignore-system-messages | Boolean. If set to `true`, removes system messages from a GPT chat completion prompt before assessing cache similarity. | No | false |
51+
| ignore-system-messages | Boolean. When set to `true` (recommended), removes system messages from a GPT chat completion prompt before assessing cache similarity. | No | false |
5252
| max-message-count | If specified, number of remaining dialog messages after which caching is skipped. | No | N/A |
5353
5454
## Elements
@@ -67,7 +67,8 @@ Use the `azure-openai-semantic-cache-lookup` policy to perform cache lookup of r
6767
### Usage notes
6868

6969
- This policy can only be used once in a policy section.
70-
70+
- Fine-tune the value of `score-threshold` based on your application to ensure that the right sensitivity is used when determining which queries to cache. Start with a low value such as 0.05 and adjust to optimize the ratio of cache hits to misses.
71+
- The embeddings model should have enough capacity and sufficient context size to accommodate the prompt volume and prompts.
7172

7273
## Examples
7374

articles/api-management/genai-gateway-capabilities.md

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ author: dlepow
77
ms.service: azure-api-management
88
ms.collection: ce-skilling-ai-copilot
99
ms.topic: concept-article
10-
ms.date: 02/05/2025
10+
ms.date: 04/29/2025
1111
ms.author: danlep
1212
---
1313

@@ -18,7 +18,7 @@ ms.author: danlep
1818
This article introduces capabilities in Azure API Management to help you manage generative AI APIs, such as those provided by [Azure OpenAI Service](/azure/ai-services/openai/overview). Azure API Management provides a range of policies, metrics, and other features to enhance security, performance, and reliability for the APIs serving your intelligent apps. Collectively, these features are called *AI gateway capabilities* for your generative AI APIs.
1919

2020
> [!NOTE]
21-
> * This article focuses on capabilities to manage APIs exposed by Azure OpenAI Service. Many of the AI gateway capabilities apply to other large language model (LLM) APIs, including those available through [Azure AI Model Inference API](/azure/ai-studio/reference/reference-model-inference-api).
21+
> * Use AI gateway capabilities to manage APIs exposed by Azure OpenAI Service, available through [Azure AI Model Inference API](/azure/ai-studio/reference/reference-model-inference-api), or with OpenAI-compatible models served through third-party inference providers.
2222
> * AI gateway capabilities are features of API Management's existing API gateway, not a separate API gateway. For more information on API Management, see [Azure API Management overview](api-management-key-concepts.md).
2323
2424
## Challenges in managing generative AI APIs
@@ -36,7 +36,7 @@ The rest of this article describes how Azure API Management can help you address
3636

3737
## Import Azure OpenAI Service resource as an API
3838

39-
[Import an API from an Azure OpenAI Service endpoint](azure-openai-api-from-specification.md) to Azure API management using a single-click experience. API Management streamlines the onboarding process by automatically importing the OpenAPI schema for the Azure OpenAI API and sets up authentication to the Azure OpenAI endpoint using managed identity, removing the need for manual configuration. Within the same user-friendly experience, you can preconfigure policies for [token limits](#token-limit-policy) and [emitting token metrics](#emit-token-metric-policy).
39+
[Import an API from an Azure OpenAI Service endpoint](azure-openai-api-from-specification.md) to Azure API management using a single-click experience. API Management streamlines the onboarding process by automatically importing the OpenAPI schema for the Azure OpenAI API and sets up authentication to the Azure OpenAI endpoint using managed identity, removing the need for manual configuration. Within the same user-friendly experience, you can preconfigure policies for [token limits](#token-limit-policy), [emitting token metrics](#emit-token-metric-policy), and [semantic caching](#semantic-caching-policy).
4040

4141
:::image type="content" source="media/azure-openai-api-from-specification/azure-openai-api.png" alt-text="Screenshot of Azure OpenAI API tile in the portal.":::
4242

@@ -57,7 +57,7 @@ The following basic example demonstrates how to set a TPM limit of 500 per subsc
5757
```
5858

5959
> [!TIP]
60-
> To manage and enforce token limits for LLM APIs available through the Azure AI Model Inference API, API Management provides the equivalent [llm-token-limit](llm-token-limit-policy.md) policy.
60+
> To manage and enforce token limits for other LLM APIs, API Management provides the equivalent [llm-token-limit](llm-token-limit-policy.md) policy.
6161
6262

6363
## Emit token metric policy
@@ -79,7 +79,7 @@ For example, the following policy sends metrics to Application Insights split by
7979
```
8080

8181
> [!TIP]
82-
> To send metrics for LLM APIs available through the Azure AI Model Inference API, API Management provides the equivalent [llm-emit-token-metric](llm-emit-token-metric-policy.md) policy.
82+
> To send metrics for other LLM APIs, API Management provides the equivalent [llm-emit-token-metric](llm-emit-token-metric-policy.md) policy.
8383
8484
## Backend load balancer and circuit breaker
8585

@@ -99,12 +99,18 @@ Configure [Azure OpenAI semantic caching](azure-openai-enable-semantic-caching.m
9999

100100
:::image type="content" source="media/genai-gateway-capabilities/semantic-caching.png" alt-text="Diagram of semantic caching in API Management.":::
101101

102-
In API Management, enable semantic caching by using Azure Redis Enterprise or another [external cache](api-management-howto-cache-external.md) compatible with RediSearch and onboarded to Azure API Management. By using the Azure OpenAI Service Embeddings API, the [azure-openai-semantic-cache-store](azure-openai-semantic-cache-store-policy.md) and [azure-openai-semantic-cache-lookup](azure-openai-semantic-cache-lookup-policy.md) policies store and retrieve semantically similar prompt completions from the cache. This approach ensures completions reuse, resulting in reduced token consumption and improved response performance.
102+
In API Management, enable semantic caching by using Azure Redis Enterprise, Azure Managed Redis, or another [external cache](api-management-howto-cache-external.md) compatible with RediSearch and onboarded to Azure API Management. By using the Azure OpenAI Service Embeddings API, the [azure-openai-semantic-cache-store](azure-openai-semantic-cache-store-policy.md) and [azure-openai-semantic-cache-lookup](azure-openai-semantic-cache-lookup-policy.md) policies store and retrieve semantically similar prompt completions from the cache. This approach ensures completions reuse, resulting in reduced token consumption and improved response performance.
103103

104104
> [!TIP]
105-
> To enable semantic caching for LLM APIs available through the Azure AI Model Inference API, API Management provides the equivalent [llm-semantic-cache-store-policy](llm-semantic-cache-store-policy.md) and [llm-semantic-cache-lookup-policy](llm-semantic-cache-lookup-policy.md) policies.
105+
> To enable semantic caching for other LLM APIs, API Management provides the equivalent [llm-semantic-cache-store-policy](llm-semantic-cache-store-policy.md) and [llm-semantic-cache-lookup-policy](llm-semantic-cache-lookup-policy.md) policies.
106106
107107

108+
## Content safety policy
109+
110+
To help safeguard users from harmful, offensive, or misleading content, you can automatically moderate all incoming requests to an LLM API by configuring the [llm-content-safety](llm-content-safety-policy.md) policy. The policy enforces content safety checks on LLM prompts by transmitting them first to the [Azure AI Content Safety](/azure/ai-services/content-safety/overview) service before sending to the backend LLM API.
111+
112+
:::image type="content" source="media/genai-gateway-capabilities/content-safety.png" alt-text="Diagram of moderating prompts by Azure AI Content Safety in an API Management policy.":::
113+
108114
## Labs and samples
109115

110116
* [Labs for the AI gateway capabilities of Azure API Management](https://github.com/Azure-Samples/ai-gateway)

articles/api-management/llm-emit-token-metric-policy.md

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,6 @@ ms.custom:
1818

1919
The `llm-emit-token-metric` policy sends custom metrics to Application Insights about consumption of large language model (LLM) tokens through LLM APIs. Token count metrics include: Total Tokens, Prompt Tokens, and Completion Tokens.
2020

21-
> [!NOTE]
22-
> Currently, this policy is in preview.
23-
2421
[!INCLUDE [api-management-policy-generic-alert](../../includes/api-management-policy-generic-alert.md)]
2522

2623
[!INCLUDE [api-management-llm-models](../../includes/api-management-llm-models.md)]

articles/api-management/llm-semantic-cache-lookup-policy.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ ms.collection: ce-skilling-ai-copilot
99
ms.custom:
1010
- build-2024
1111
ms.topic: reference
12-
ms.date: 12/13/2024
12+
ms.date: 04/29/2025
1313
ms.author: danlep
1414
---
1515

@@ -22,7 +22,6 @@ Use the `llm-semantic-cache-lookup` policy to perform cache lookup of responses
2222
> [!NOTE]
2323
> * This policy must have a corresponding [Cache responses to large language model API requests](llm-semantic-cache-store-policy.md) policy.
2424
> * For prerequisites and steps to enable semantic caching, see [Enable semantic caching for Azure OpenAI APIs in Azure API Management](azure-openai-enable-semantic-caching.md).
25-
> * Currently, this policy is in preview.
2625
2726
[!INCLUDE [api-management-policy-generic-alert](../../includes/api-management-policy-generic-alert.md)]
2827

@@ -45,10 +44,10 @@ Use the `llm-semantic-cache-lookup` policy to perform cache lookup of responses
4544

4645
| Attribute | Description | Required | Default |
4746
| ----------------- | ------------------------------------------------------ | -------- | ------- |
48-
| score-threshold | Similarity score threshold used to determine whether to return a cached response to a prompt. Value is a decimal between 0.0 and 1.0. [Learn more](../redis/tutorial-semantic-cache.md#change-the-similarity-threshold). | Yes | N/A |
47+
| score-threshold | Similarity score threshold used to determine whether to return a cached response to a prompt. Value is a decimal between 0.0 and 1.0. Smaller values represent greater semantic similarity. [Learn more](../redis/tutorial-semantic-cache.md#change-the-similarity-threshold). | Yes | N/A |
4948
| embeddings-backend-id | [Backend](backends.md) ID for OpenAI embeddings API call. | Yes | N/A |
5049
| embeddings-backend-auth | Authentication used for Azure OpenAI embeddings API backend. | Yes. Must be set to `system-assigned`. | N/A |
51-
| ignore-system-messages | Boolean. If set to `true`, removes system messages from a GPT chat completion prompt before assessing cache similarity. | No | false |
50+
| ignore-system-messages | Boolean. When set to `true` (recommended), removes system messages from a GPT chat completion prompt before assessing cache similarity. | No | false |
5251
| max-message-count | If specified, number of remaining dialog messages after which caching is skipped. | No | N/A |
5352
5453
## Elements
@@ -67,6 +66,8 @@ Use the `llm-semantic-cache-lookup` policy to perform cache lookup of responses
6766
### Usage notes
6867

6968
- This policy can only be used once in a policy section.
69+
- Fine-tune the value of `score-threshold` based on your application to ensure that the right sensitivity is used when determining which queries to cache. Start with a low value such as 0.05 and adjust to optimize the ratio of cache hits to misses.
70+
- The embeddings model should have enough capacity and sufficient context size to accommodate the prompt volume and prompts.
7071

7172

7273
## Examples

articles/api-management/llm-semantic-cache-store-policy.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,6 @@ The `llm-semantic-cache-store` policy caches responses to chat completion API re
2121
> [!NOTE]
2222
> * This policy must have a corresponding [Get cached responses to large language model API requests](llm-semantic-cache-lookup-policy.md) policy.
2323
> * For prerequisites and steps to enable semantic caching, see [Enable semantic caching for Azure OpenAI APIs in Azure API Management](azure-openai-enable-semantic-caching.md).
24-
> * Currently, this policy is in preview.
2524
2625
[!INCLUDE [api-management-policy-generic-alert](../../includes/api-management-policy-generic-alert.md)]
2726

articles/api-management/llm-token-limit-policy.md

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,9 +20,6 @@ The `llm-token-limit` policy prevents large language model (LLM) API usage spike
2020

2121
By relying on token usage metrics returned from the LLM endpoint, the policy can accurately monitor and enforce limits in real time. The policy also enables precalculation of prompt tokens by API Management, minimizing unnecessary requests to the LLM backend if the limit is already exceeded.
2222

23-
> [!NOTE]
24-
> Currently, this policy is in preview.
25-
2623
[!INCLUDE [api-management-policy-generic-alert](../../includes/api-management-policy-generic-alert.md)]
2724

2825
[!INCLUDE [api-management-llm-models](../../includes/api-management-llm-models.md)]
174 KB
Loading

0 commit comments

Comments
 (0)