Skip to content

Commit 4fe54f4

Browse files
committed
draft complete
1 parent 05a2e1f commit 4fe54f4

8 files changed

+27
-159
lines changed

articles/api-management/api-management-policies.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ More information about policies:
3737
| [Set usage quota by key](quota-by-key-policy.md) | Allows you to enforce a renewable or lifetime call volume and/or bandwidth quota, on a per key basis. | Yes | No | No | Yes |
3838
| [Limit concurrency](limit-concurrency-policy.md) | Prevents enclosed policies from executing by more than the specified number of requests at a time. | Yes | Yes | Yes | Yes |
3939
| [Limit Azure OpenAI Service token usage](azure-openai-token-limit-policy.md) | Prevents Azure OpenAI API usage spikes by limiting language model tokens per calculated key. | Yes | Yes | No | No |
40-
| [Limit large language model API token usage](llm-token-limit-policy.md) | Prevents large language model API usage spikes by limiting language model tokens per calculated key. | Yes | Yes | No | No |
40+
| [Limit large language model API token usage](llm-token-limit-policy.md) | Prevents large language model (LLM) API usage spikes by limiting LLM tokens per calculated key. | Yes | Yes | No | No |
4141

4242
## Authentication and authorization
4343

@@ -134,7 +134,7 @@ More information about policies:
134134
| [Trace](trace-policy.md) | Adds custom traces into the [request tracing](./api-management-howto-api-inspector.md) output in the test console, Application Insights telemetries, and resource logs. | Yes | Yes<sup>1</sup> | Yes | Yes |
135135
| [Emit metrics](emit-metric-policy.md) | Sends custom metrics to Application Insights at execution. | Yes | Yes | Yes | Yes |
136136
| [Emit Azure OpenAI token metrics](azure-openai-emit-token-metric-policy.md) | Sends metrics to Application Insights for consumption of language model tokens through Azure OpenAI service APIs. | Yes | Yes | No | No |
137-
| [Emit large language model API token metrics](llm-emit-token-metric-policy.md) | Sends metrics to Application Insights for consumption of language model tokens through large language model APIs. | Yes | Yes | No | No |
137+
| [Emit large language model API token metrics](llm-emit-token-metric-policy.md) | Sends metrics to Application Insights for consumption of large language model (LLM) tokens through LLM APIs. | Yes | Yes | No | No |
138138

139139

140140
<sup>1</sup> In the V2 gateway, the `trace` policy currently does not add tracing output in the test console.

articles/api-management/azure-openai-enable-semantic-caching.md

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,9 @@ ms.collection: ce-skilling-ai-copilot
1717

1818
Enable semantic caching of responses to Azure OpenAI API requests to reduce bandwidth and processing requirements imposed on the backend APIs and lower latency perceived by API consumers. With semantic caching, you can return cached responses for identical prompts and also for prompts that are similar in meaning, even if the text isn't the same. For background, see [Tutorial: Use Azure Cache for Redis as a semantic cache](../azure-cache-for-redis/cache-tutorial-semantic-cache.md).
1919

20+
> [!NOTE]
21+
> The configuration steps in this article enable semantic caching for Azure OpenAI APIs. These steps can be generalized to enable semantic caching for corresponding large language model (LLM) APIs available through the [Azure AI Model Inference API](../ai-studio/reference/reference-model-inference-api.md).
22+
2023
## Prerequisites
2124

2225
* One or more Azure OpenAI Service APIs must be added to your API Management instance. For more information, see [Add an Azure OpenAI Service API to Azure API Management](azure-openai-api-from-specification.md).
@@ -48,13 +51,13 @@ with request body:
4851

4952
When the request succeeds, the response includes a completion for the chat message.
5053

51-
## Create a backend for Embeddings API
54+
## Create a backend for embeddings API
5255

53-
Configure a [backend](backends.md) resource for the Embeddings API deployment with the following settings:
56+
Configure a [backend](backends.md) resource for the embeddings API deployment with the following settings:
5457

5558
* **Name** - A name of your choice, such as `embeddings-backend`. You use this name to reference the backend in policies.
5659
* **Type** - Select **Custom URL**.
57-
* **Runtime URL** - The URL of the Embeddings API deployment in the Azure OpenAI Service, similar to:
60+
* **Runtime URL** - The URL of the embeddings API deployment in the Azure OpenAI Service, similar to:
5861
```
5962
https://my-aoai.openai.azure.com/openai/deployments/embeddings-deployment/embeddings
6063
```
@@ -111,6 +114,9 @@ If the request is successful, the response includes a vector representation of t
111114
Configure the following policies to enable semantic caching for Azure OpenAI APIs in Azure API Management:
112115
* In the **Inbound processing** section for the API, add the [azure-openai-semantic-cache-lookup](azure-openai-semantic-cache-lookup-policy.md) policy. In the `embeddings-backend-id` attribute, specify the Embeddings API backend you created.
113116

117+
> [!NOTE]
118+
> When enabling semantic caching for other large language model APIs, use the [llm-semantic-cache-lookup](llm-semantic-cache-lookup-policy.md) policy instead.
119+
114120
Example:
115121

116122
```xml
@@ -125,6 +131,9 @@ Configure the following policies to enable semantic caching for Azure OpenAI API
125131

126132
* In the **Outbound processing** section for the API, add the [azure-openai-semantic-cache-store](azure-openai-semantic-cache-store-policy.md) policy.
127133

134+
> [!NOTE]
135+
> When enabling semantic caching for other large language model APIs, use the [llm-semantic-cache-store](llm-semantic-cache-store-policy.md) policy instead.
136+
128137
Example:
129138

130139
```xml

articles/api-management/llm-emit-token-metric-policy.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ author: dlepow
66

77
ms.service: azure-api-management
88
ms.topic: article
9-
ms.date: 08/07/2024
9+
ms.date: 08/08/2024
1010
ms.author: danlep
1111
ms.collection: ce-skilling-ai-copilot
1212
ms.custom:
@@ -16,7 +16,7 @@ ms.custom:
1616

1717
[!INCLUDE [api-management-availability-all-tiers](../../includes/api-management-availability-all-tiers.md)]
1818

19-
The `llm-emit-token-metric` policy sends metrics to Application Insights about consumption of large language model tokens through LLM APIs. Token count metrics include: Total Tokens, Prompt Tokens, and Completion Tokens.
19+
The `llm-emit-token-metric` policy sends metrics to Application Insights about consumption of large language model (LLM) tokens through LLM APIs. Token count metrics include: Total Tokens, Prompt Tokens, and Completion Tokens.
2020

2121
[!INCLUDE [api-management-policy-generic-alert](../../includes/api-management-policy-generic-alert.md)]
2222

articles/api-management/llm-enable-semantic-caching.md

Lines changed: 0 additions & 142 deletions
This file was deleted.

articles/api-management/llm-semantic-cache-lookup-policy.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -13,15 +13,15 @@ ms.date: 08/07/2024
1313
ms.author: danlep
1414
---
1515

16-
# Get cached responses of Azure OpenAI API requests
16+
# Get cached responses of large language model API requests
1717

1818
[!INCLUDE [api-management-availability-all-tiers](../../includes/api-management-availability-all-tiers.md)]
1919

20-
Use the `llm-semantic-cache-lookup` policy to perform cache lookup of responses to Azure OpenAI Chat Completion API and Completion API requests from a configured external cache, based on vector proximity of the prompt to previous requests and a specified similarity score threshold. Response caching reduces bandwidth and processing requirements imposed on the backend Azure OpenAI API and lowers latency perceived by API consumers.
20+
Use the `llm-semantic-cache-lookup` policy to perform cache lookup of responses to large language model (LLM) API requests from a configured external cache, based on vector proximity of the prompt to previous requests and a specified similarity score threshold. Response caching reduces bandwidth and processing requirements imposed on the backend LLM API and lowers latency perceived by API consumers.
2121

2222
> [!NOTE]
23-
> * This policy must have a corresponding [Cache responses to Azure OpenAI API requests](llm-semantic-cache-store-policy.md) policy.
24-
> * For prerequisites and steps to enable semantic caching, see [Enable semantic caching for Azure OpenAI APIs in Azure API Management](llm-enable-semantic-caching.md).
23+
> * This policy must have a corresponding [Cache responses to large language model API requests](llm-semantic-cache-store-policy.md) policy.
24+
> * For prerequisites and steps to enable semantic caching, see [Enable semantic caching for Azure OpenAI APIs in Azure API Management](azure-openai-enable-semantic-caching.md).
2525
> * Currently, this policy is in preview.
2626
2727
[!INCLUDE [api-management-policy-generic-alert](../../includes/api-management-policy-generic-alert.md)]

articles/api-management/llm-semantic-cache-store-policy.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,19 +8,19 @@ ms.service: azure-api-management
88
ms.collection: ce-skilling-ai-copilot
99
ms.custom:
1010
ms.topic: article
11-
ms.date: 08/07/2024
11+
ms.date: 08/08/2024
1212
ms.author: danlep
1313
---
1414

1515
# Cache responses to large language model API requests
1616

1717
[!INCLUDE [api-management-availability-all-tiers](../../includes/api-management-availability-all-tiers.md)]
1818

19-
The `llm-semantic-cache-store` policy caches responses to [TBD - add here] requests to a configured external cache. Response caching reduces bandwidth and processing requirements imposed on the backend Azure OpenAI API and lowers latency perceived by API consumers.
19+
The `llm-semantic-cache-store` policy caches responses to chat completion API and completion API requests to a configured external cache. Response caching reduces bandwidth and processing requirements imposed on the backend Azure OpenAI API and lowers latency perceived by API consumers.
2020

2121
> [!NOTE]
2222
> * This policy must have a corresponding [Get cached responses to large language model API requests](llm-semantic-cache-lookup-policy.md) policy.
23-
> * For prerequisites and steps to enable semantic caching, see [Enable semantic caching for large language model APIs in Azure API Management](llm-enable-semantic-caching.md).
23+
> * For prerequisites and steps to enable semantic caching, see [Enable semantic caching for Azure OpenAI APIs in Azure API Management](azure-openai-enable-semantic-caching.md).
2424
> * Currently, this policy is in preview.
2525
2626
[!INCLUDE [api-management-policy-generic-alert](../../includes/api-management-policy-generic-alert.md)]

articles/api-management/llm-token-limit-policy.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ ms.service: azure-api-management
88
ms.collection: ce-skilling-ai-copilot
99
ms.custom:
1010
ms.topic: article
11-
ms.date: 08/07/2024
11+
ms.date: 08/08/2024
1212
ms.author: danlep
1313
---
1414

@@ -43,7 +43,7 @@ By relying on token usage metrics returned from the LLM endpoint, the policy can
4343
| -------------- | ----------------------------------------------------------------------------------------------------- | -------- | ------- |
4444
| counter-key | The key to use for the token limit policy. For each key value, a single counter is used for all scopes at which the policy is configured. Policy expressions are allowed.| Yes | N/A |
4545
| tokens-per-minute | The maximum number of tokens consumed by prompt and completion per minute. | Yes | N/A |
46-
| estimate-prompt-tokens | Boolean value that determines whether to estimate the number of tokens required for a prompt: <br> - `true`: estimate the number of tokens based on prompt schema in API; may reduce performance. <br> - `false`: don't estimate prompt tokens. | Yes | N/A |
46+
| estimate-prompt-tokens | Boolean value that determines whether to estimate the number of tokens required for a prompt: <br> - `true`: estimate the number of tokens based on prompt schema in API; may reduce performance. <br> - `false`: don't estimate prompt tokens. <br><br>When set to `false`, the remaining tokens per `counter-key` are calculated using the actual token usage from the response of the model. This could result in prompts being sent to the model that exceed the token limit. In such case, this will be detected in the response, and all succeeding requests will be blocked by the policy until the token limit frees up again. | Yes | N/A |
4747
| retry-after-header-name | The name of a custom response header whose value is the recommended retry interval in seconds after the specified `tokens-per-minute` is exceeded. Policy expressions aren't allowed. | No | `Retry-After` |
4848
| retry-after-variable-name | The name of a variable that stores the recommended retry interval in seconds after the specified `tokens-per-minute` is exceeded. Policy expressions aren't allowed. | No | N/A |
4949
| remaining-tokens-header-name | The name of a response header whose value after each policy execution is the number of remaining tokens allowed for the time interval. Policy expressions aren't allowed.| No | N/A |
@@ -59,6 +59,7 @@ By relying on token usage metrics returned from the LLM endpoint, the policy can
5959

6060
### Usage notes
6161

62+
*
6263
* This policy can be used multiple times per policy definition.
6364
* Where available when `estimate-prompt-tokens` is set to `false`, values in the usage section of the response from the LLM API are used to determine token usage.
6465
* Certain LLM endpoints support streaming of responses. When `stream` is set to `true` in the API request to enable streaming, prompt tokens are always estimated, regardless of the value of the `estimate-prompt-tokens` attribute.

includes/api-management-llm-models.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,4 +10,4 @@ ms.author: danlep
1010

1111
## Supported models
1212

13-
The policy is used with LLMs added to Azure API Management that are exposed through the [Azure AI Model Inference API](../articles/ai-studio/reference/reference-model-inference-api.md).
13+
Use the policy with LLM APIs added to Azure API Management that are available through the [Azure AI Model Inference API](../articles/ai-studio/reference/reference-model-inference-api.md).

0 commit comments

Comments
 (0)