Skip to content

Commit 029703a

Browse files
author
gitName
committed
rebase
1 parent d6718f4 commit 029703a

6 files changed

+13
-18
lines changed

articles/api-management/azure-openai-semantic-cache-lookup-policy.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ ms.collection: ce-skilling-ai-copilot
99
ms.custom:
1010
- build-2024
1111
ms.topic: reference
12-
ms.date: 12/13/2024
12+
ms.date: 04/29/2025
1313
ms.author: danlep
1414
---
1515

@@ -22,7 +22,7 @@ Use the `azure-openai-semantic-cache-lookup` policy to perform cache lookup of r
2222
> [!NOTE]
2323
> * This policy must have a corresponding [Cache responses to Azure OpenAI API requests](azure-openai-semantic-cache-store-policy.md) policy.
2424
> * For prerequisites and steps to enable semantic caching, see [Enable semantic caching for Azure OpenAI APIs in Azure API Management](azure-openai-enable-semantic-caching.md).
25-
> * Currently, this policy is in preview.
25+
2626

2727
[!INCLUDE [api-management-policy-generic-alert](../../includes/api-management-policy-generic-alert.md)]
2828

@@ -45,10 +45,10 @@ Use the `azure-openai-semantic-cache-lookup` policy to perform cache lookup of r
4545

4646
| Attribute | Description | Required | Default |
4747
| ----------------- | ------------------------------------------------------ | -------- | ------- |
48-
| score-threshold | Similarity score threshold used to determine whether to return a cached response to a prompt. Value is a decimal between 0.0 and 1.0. [Learn more](../redis/tutorial-semantic-cache.md#change-the-similarity-threshold). | Yes | N/A |
48+
| score-threshold | Similarity score threshold used to determine whether to return a cached response to a prompt. Value is a decimal between 0.0 and 1.0. Smaller values represent greater semantic similarity. [Learn more](../redis/tutorial-semantic-cache.md#change-the-similarity-threshold). | Yes | N/A |
4949
| embeddings-backend-id | [Backend](backends.md) ID for OpenAI embeddings API call. | Yes | N/A |
5050
| embeddings-backend-auth | Authentication used for Azure OpenAI embeddings API backend. | Yes. Must be set to `system-assigned`. | N/A |
51-
| ignore-system-messages | Boolean. If set to `true`, removes system messages from a GPT chat completion prompt before assessing cache similarity. | No | false |
51+
| ignore-system-messages | Boolean. When set to `true` (recommended), removes system messages from a GPT chat completion prompt before assessing cache similarity. | No | false |
5252
| max-message-count | If specified, number of remaining dialog messages after which caching is skipped. | No | N/A |
5353
5454
## Elements
@@ -67,7 +67,8 @@ Use the `azure-openai-semantic-cache-lookup` policy to perform cache lookup of r
6767
### Usage notes
6868

6969
- This policy can only be used once in a policy section.
70-
70+
- Fine-tune the value of `score-threshold` based on your application to ensure that the right sensitivity is used when determining which queries to cache. Start with a low value such as 0.05 and adjust to optimize the ratio of cache hits to misses.
71+
- The embeddings model should have enough capacity and sufficient context size to accommodate the prompt volume and prompts.
7172

7273
## Examples
7374

articles/api-management/llm-emit-token-metric-policy.md

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,6 @@ ms.custom:
1818

1919
The `llm-emit-token-metric` policy sends custom metrics to Application Insights about consumption of large language model (LLM) tokens through LLM APIs. Token count metrics include: Total Tokens, Prompt Tokens, and Completion Tokens.
2020

21-
> [!NOTE]
22-
> Currently, this policy is in preview.
23-
2421
[!INCLUDE [api-management-policy-generic-alert](../../includes/api-management-policy-generic-alert.md)]
2522

2623
[!INCLUDE [api-management-llm-models](../../includes/api-management-llm-models.md)]

articles/api-management/llm-semantic-cache-lookup-policy.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ ms.collection: ce-skilling-ai-copilot
99
ms.custom:
1010
- build-2024
1111
ms.topic: reference
12-
ms.date: 12/13/2024
12+
ms.date: 04/29/2025
1313
ms.author: danlep
1414
---
1515

@@ -22,7 +22,6 @@ Use the `llm-semantic-cache-lookup` policy to perform cache lookup of responses
2222
> [!NOTE]
2323
> * This policy must have a corresponding [Cache responses to large language model API requests](llm-semantic-cache-store-policy.md) policy.
2424
> * For prerequisites and steps to enable semantic caching, see [Enable semantic caching for Azure OpenAI APIs in Azure API Management](azure-openai-enable-semantic-caching.md).
25-
> * Currently, this policy is in preview.
2625
2726
[!INCLUDE [api-management-policy-generic-alert](../../includes/api-management-policy-generic-alert.md)]
2827

@@ -45,10 +44,10 @@ Use the `llm-semantic-cache-lookup` policy to perform cache lookup of responses
4544

4645
| Attribute | Description | Required | Default |
4746
| ----------------- | ------------------------------------------------------ | -------- | ------- |
48-
| score-threshold | Similarity score threshold used to determine whether to return a cached response to a prompt. Value is a decimal between 0.0 and 1.0. [Learn more](../redis/tutorial-semantic-cache.md#change-the-similarity-threshold). | Yes | N/A |
47+
| score-threshold | Similarity score threshold used to determine whether to return a cached response to a prompt. Value is a decimal between 0.0 and 1.0. Smaller values represent greater semantic similarity. [Learn more](../redis/tutorial-semantic-cache.md#change-the-similarity-threshold). | Yes | N/A |
4948
| embeddings-backend-id | [Backend](backends.md) ID for OpenAI embeddings API call. | Yes | N/A |
5049
| embeddings-backend-auth | Authentication used for Azure OpenAI embeddings API backend. | Yes. Must be set to `system-assigned`. | N/A |
51-
| ignore-system-messages | Boolean. If set to `true`, removes system messages from a GPT chat completion prompt before assessing cache similarity. | No | false |
50+
| ignore-system-messages | Boolean. When set to `true` (recommended), removes system messages from a GPT chat completion prompt before assessing cache similarity. | No | false |
5251
| max-message-count | If specified, number of remaining dialog messages after which caching is skipped. | No | N/A |
5352
5453
## Elements
@@ -67,6 +66,8 @@ Use the `llm-semantic-cache-lookup` policy to perform cache lookup of responses
6766
### Usage notes
6867

6968
- This policy can only be used once in a policy section.
69+
- Fine-tune the value of `score-threshold` based on your application to ensure that the right sensitivity is used when determining which queries to cache. Start with a low value such as 0.05 and adjust to optimize the ratio of cache hits to misses.
70+
- The embeddings model should have enough capacity and sufficient context size to accommodate the prompt volume and prompts.
7071

7172

7273
## Examples

articles/api-management/llm-semantic-cache-store-policy.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,6 @@ The `llm-semantic-cache-store` policy caches responses to chat completion API re
2121
> [!NOTE]
2222
> * This policy must have a corresponding [Get cached responses to large language model API requests](llm-semantic-cache-lookup-policy.md) policy.
2323
> * For prerequisites and steps to enable semantic caching, see [Enable semantic caching for Azure OpenAI APIs in Azure API Management](azure-openai-enable-semantic-caching.md).
24-
> * Currently, this policy is in preview.
2524
2625
[!INCLUDE [api-management-policy-generic-alert](../../includes/api-management-policy-generic-alert.md)]
2726

articles/api-management/llm-token-limit-policy.md

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,9 +20,6 @@ The `llm-token-limit` policy prevents large language model (LLM) API usage spike
2020

2121
By relying on token usage metrics returned from the LLM endpoint, the policy can accurately monitor and enforce limits in real time. The policy also enables precalculation of prompt tokens by API Management, minimizing unnecessary requests to the LLM backend if the limit is already exceeded.
2222

23-
> [!NOTE]
24-
> Currently, this policy is in preview.
25-
2623
[!INCLUDE [api-management-policy-generic-alert](../../includes/api-management-policy-generic-alert.md)]
2724

2825
[!INCLUDE [api-management-llm-models](../../includes/api-management-llm-models.md)]

includes/api-management-llm-models.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,10 @@ ms.service: azure-api-management
44
ms.custom:
55
- build-2024
66
ms.topic: include
7-
ms.date: 07/09/2024
7+
ms.date: 04/29/2025
88
ms.author: danlep
99
---
1010

1111
## Supported models
1212

13-
Use the policy with LLM APIs added to Azure API Management that are available through the [Azure AI Model Inference API](/azure/ai-studio/reference/reference-model-inference-api).
13+
Use the policy with LLM APIs added to Azure API Management that are available through the [Azure AI Model Inference API](/azure/ai-studio/reference/reference-model-inference-api) or with OpenAI-compatible models served through third-party inference providers.

0 commit comments

Comments
 (0)