You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -22,7 +22,7 @@ Use the `azure-openai-semantic-cache-lookup` policy to perform cache lookup of r
22
22
> [!NOTE]
23
23
> * This policy must have a corresponding [Cache responses to Azure OpenAI API requests](azure-openai-semantic-cache-store-policy.md) policy.
24
24
> * For prerequisites and steps to enable semantic caching, see [Enable semantic caching for Azure OpenAI APIs in Azure API Management](azure-openai-enable-semantic-caching.md).
| score-threshold | Similarity score threshold used to determine whether to return a cached response to a prompt. Value is a decimal between 0.0 and 1.0. [Learn more](../redis/tutorial-semantic-cache.md#change-the-similarity-threshold). | Yes | N/A |
48
+
| score-threshold | Similarity score threshold used to determine whether to return a cached response to a prompt. Value is a decimal between 0.0 and 1.0. Smaller values represent greater semantic similarity. [Learn more](../redis/tutorial-semantic-cache.md#change-the-similarity-threshold). | Yes | N/A |
49
49
| embeddings-backend-id |[Backend](backends.md) ID for OpenAI embeddings API call. | Yes | N/A |
50
50
| embeddings-backend-auth | Authentication used for Azure OpenAI embeddings API backend. | Yes. Must be set to `system-assigned`. | N/A |
51
-
| ignore-system-messages | Boolean. If set to `true`, removes system messages from a GPT chat completion prompt before assessing cache similarity. | No | false |
51
+
| ignore-system-messages | Boolean. When set to `true` (recommended), removes system messages from a GPT chat completion prompt before assessing cache similarity. | No | false |
52
52
| max-message-count | If specified, number of remaining dialog messages after which caching is skipped. | No | N/A |
53
53
54
54
## Elements
@@ -67,7 +67,8 @@ Use the `azure-openai-semantic-cache-lookup` policy to perform cache lookup of r
67
67
### Usage notes
68
68
69
69
- This policy can only be used once in a policy section.
70
-
70
+
- Fine-tune the value of `score-threshold` based on your application to ensure that the right sensitivity is used when determining which queries to cache. Start with a low value such as 0.05 and adjust to optimize the ratio of cache hits to misses.
71
+
- The embeddings model should have enough capacity and sufficient context size to accommodate the prompt volume and prompts.
Copy file name to clipboardExpand all lines: articles/api-management/llm-emit-token-metric-policy.md
-3Lines changed: 0 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,9 +18,6 @@ ms.custom:
18
18
19
19
The `llm-emit-token-metric` policy sends custom metrics to Application Insights about consumption of large language model (LLM) tokens through LLM APIs. Token count metrics include: Total Tokens, Prompt Tokens, and Completion Tokens.
@@ -22,7 +22,6 @@ Use the `llm-semantic-cache-lookup` policy to perform cache lookup of responses
22
22
> [!NOTE]
23
23
> * This policy must have a corresponding [Cache responses to large language model API requests](llm-semantic-cache-store-policy.md) policy.
24
24
> * For prerequisites and steps to enable semantic caching, see [Enable semantic caching for Azure OpenAI APIs in Azure API Management](azure-openai-enable-semantic-caching.md).
| score-threshold | Similarity score threshold used to determine whether to return a cached response to a prompt. Value is a decimal between 0.0 and 1.0. [Learn more](../redis/tutorial-semantic-cache.md#change-the-similarity-threshold). | Yes | N/A |
47
+
| score-threshold | Similarity score threshold used to determine whether to return a cached response to a prompt. Value is a decimal between 0.0 and 1.0. Smaller values represent greater semantic similarity. [Learn more](../redis/tutorial-semantic-cache.md#change-the-similarity-threshold). | Yes | N/A |
49
48
| embeddings-backend-id |[Backend](backends.md) ID for OpenAI embeddings API call. | Yes | N/A |
50
49
| embeddings-backend-auth | Authentication used for Azure OpenAI embeddings API backend. | Yes. Must be set to `system-assigned`. | N/A |
51
-
| ignore-system-messages | Boolean. If set to `true`, removes system messages from a GPT chat completion prompt before assessing cache similarity. | No | false |
50
+
| ignore-system-messages | Boolean. When set to `true` (recommended), removes system messages from a GPT chat completion prompt before assessing cache similarity. | No | false |
52
51
| max-message-count | If specified, number of remaining dialog messages after which caching is skipped. | No | N/A |
53
52
54
53
## Elements
@@ -67,6 +66,8 @@ Use the `llm-semantic-cache-lookup` policy to perform cache lookup of responses
67
66
### Usage notes
68
67
69
68
- This policy can only be used once in a policy section.
69
+
- Fine-tune the value of `score-threshold` based on your application to ensure that the right sensitivity is used when determining which queries to cache. Start with a low value such as 0.05 and adjust to optimize the ratio of cache hits to misses.
70
+
- The embeddings model should have enough capacity and sufficient context size to accommodate the prompt volume and prompts.
Copy file name to clipboardExpand all lines: articles/api-management/llm-semantic-cache-store-policy.md
-1Lines changed: 0 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,7 +21,6 @@ The `llm-semantic-cache-store` policy caches responses to chat completion API re
21
21
> [!NOTE]
22
22
> * This policy must have a corresponding [Get cached responses to large language model API requests](llm-semantic-cache-lookup-policy.md) policy.
23
23
> * For prerequisites and steps to enable semantic caching, see [Enable semantic caching for Azure OpenAI APIs in Azure API Management](azure-openai-enable-semantic-caching.md).
Copy file name to clipboardExpand all lines: articles/api-management/llm-token-limit-policy.md
-3Lines changed: 0 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,9 +20,6 @@ The `llm-token-limit` policy prevents large language model (LLM) API usage spike
20
20
21
21
By relying on token usage metrics returned from the LLM endpoint, the policy can accurately monitor and enforce limits in real time. The policy also enables precalculation of prompt tokens by API Management, minimizing unnecessary requests to the LLM backend if the limit is already exceeded.
Use the policy with LLM APIs added to Azure API Management that are available through the [Azure AI Model Inference API](/azure/ai-studio/reference/reference-model-inference-api).
13
+
Use the policy with LLM APIs added to Azure API Management that are available through the [Azure AI Model Inference API](/azure/ai-studio/reference/reference-model-inference-api) or with OpenAI-compatible models served through third-party inference providers.
0 commit comments