Skip to content

Commit e29b22a

Browse files
committed
[APIM] LLM API policies
1 parent 3f663b1 commit e29b22a

8 files changed

+509
-0
lines changed

articles/api-management/api-management-policies.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ More information about policies:
3737
| [Set usage quota by key](quota-by-key-policy.md) | Allows you to enforce a renewable or lifetime call volume and/or bandwidth quota, on a per key basis. | Yes | No | No | Yes |
3838
| [Limit concurrency](limit-concurrency-policy.md) | Prevents enclosed policies from executing by more than the specified number of requests at a time. | Yes | Yes | Yes | Yes |
3939
| [Limit Azure OpenAI Service token usage](azure-openai-token-limit-policy.md) | Prevents Azure OpenAI API usage spikes by limiting language model tokens per calculated key. | Yes | Yes | No | No |
40+
| [Limit large language model API token usage](llm-token-limit-policy.md) | Prevents large language model API usage spikes by limiting language model tokens per calculated key. | Yes | Yes | No | No |
4041

4142
## Authentication and authorization
4243

@@ -82,6 +83,8 @@ More information about policies:
8283
| [Remove value from cache](cache-remove-value-policy.md) | Removes an item in the cache by key. | Yes | Yes | Yes | Yes |
8384
| [Get cached responses of Azure OpenAI API requests](azure-openai-semantic-cache-lookup-policy.md) | Performs cache lookup using semantic search and returns a valid cached response when available. | Yes | Yes | Yes | Yes |
8485
| [Store responses of Azure OpenAI API requests to cache](azure-openai-semantic-cache-store-policy.md) | Caches response according to the Azure OpenAI API cache configuration. | Yes | Yes | Yes | Yes |
86+
| [Get cached responses of large language model API requests](llm-semantic-cache-lookup-policy.md) | Performs cache lookup using semantic search and returns a valid cached response when available. | Yes | Yes | Yes | Yes |
87+
| [Store responses of large language model API requests to cache](llm-semantic-cache-store-policy.md) | Caches response according to the large language model API cache configuration. | Yes | Yes | Yes | Yes |
8588

8689

8790

@@ -131,6 +134,7 @@ More information about policies:
131134
| [Trace](trace-policy.md) | Adds custom traces into the [request tracing](./api-management-howto-api-inspector.md) output in the test console, Application Insights telemetries, and resource logs. | Yes | Yes<sup>1</sup> | Yes | Yes |
132135
| [Emit metrics](emit-metric-policy.md) | Sends custom metrics to Application Insights at execution. | Yes | Yes | Yes | Yes |
133136
| [Emit Azure OpenAI token metrics](azure-openai-emit-token-metric-policy.md) | Sends metrics to Application Insights for consumption of language model tokens through Azure OpenAI service APIs. | Yes | Yes | No | No |
137+
| [Emit large language model API token metrics](llm-emit-token-metric-policy.md) | Sends metrics to Application Insights for consumption of language model tokens through large language model APIs. | Yes | Yes | No | No |
134138

135139

136140
<sup>1</sup> In the V2 gateway, the `trace` policy currently does not add tracing output in the test console.

articles/api-management/azure-openai-token-limit-policy.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,7 @@ In the following example, the token limit of 5000 per minute is keyed by the cal
8787
## Related policies
8888

8989
* [Rate limiting and quotas](api-management-policies.md#rate-limiting-and-quotas)
90+
* [llm-token-limit](llm-token-limit-policy.md) policy
9091
* [azure-openai-emit-token-metric](azure-openai-emit-token-metric-policy.md) policy
9192

9293
[!INCLUDE [api-management-policy-ref-next-steps](../../includes/api-management-policy-ref-next-steps.md)]
Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
---
2+
title: Azure API Management policy reference - llm-emit-token-metric
3+
description: Reference for the llm-emit-token-metric policy available for use in Azure API Management. Provides policy usage, settings, and examples.
4+
services: api-management
5+
author: dlepow
6+
7+
ms.service: azure-api-management
8+
ms.topic: article
9+
ms.date: 08/07/2024
10+
ms.author: danlep
11+
ms.collection: ce-skilling-ai-copilot
12+
ms.custom:
13+
---
14+
15+
# Emit metrics for consumption of large language model tokens
16+
17+
[!INCLUDE [api-management-availability-all-tiers](../../includes/api-management-availability-all-tiers.md)]
18+
19+
The `llm-emit-token-metric` policy sends metrics to Application Insights about consumption of large language model tokens through LLM APIs. Token count metrics include: Total Tokens, Prompt Tokens, and Completion Tokens.
20+
21+
[!INCLUDE [api-management-policy-generic-alert](../../includes/api-management-policy-generic-alert.md)]
22+
23+
[!INCLUDE [api-management-llm-models](../../includes/api-management-llm-models.md)]
24+
25+
26+
## Prerequisites
27+
28+
* One or more LLM APIs must be added to your API Management instance.
29+
* Your API Management instance must be integrated with Application insights. For more information, see [How to integrate Azure API Management with Azure Application Insights](./api-management-howto-app-insights.md#create-a-connection-using-the-azure-portal).
30+
* Enable Application Insights logging for your LLM APIs.
31+
* Enable custom metrics with dimensions in Application Insights. For more information, see [Emit custom metrics](api-management-howto-app-insights.md#emit-custom-metrics).
32+
33+
## Policy statement
34+
35+
```xml
36+
<llm-emit-token-metric
37+
namespace="metric namespace" >
38+
<dimension name="dimension name" value="dimension value" />
39+
...additional dimensions...
40+
</llm-emit-token-metric>
41+
```
42+
43+
## Attributes
44+
45+
| Attribute | Description | Required | Default value |
46+
| --------- | -------------------------- | ------------------ | -------------- |
47+
| namespace | A string. Namespace of metric. Policy expressions aren't allowed. | No | API Management |
48+
| value | Value of metric expressed as a double. Policy expressions are allowed. | No | 1 |
49+
50+
51+
## Elements
52+
53+
| Element | Description | Required |
54+
| ----------- | --------------------------------------------------------------------------------- | -------- |
55+
| dimension | Add one or more of these elements for each dimension included in the metric. | Yes |
56+
57+
### dimension attributes
58+
59+
| Attribute | Description | Required | Default value |
60+
| --------- | -------------------------- | ------------------ | -------------- |
61+
| name | A string or policy expression. Name of dimension. | Yes | N/A |
62+
| value | A string or policy expression. Value of dimension. Can only be omitted if `name` matches one of the default dimensions. If so, value is provided as per dimension name. | No | N/A |
63+
64+
### Default dimension names that may be used without value
65+
66+
* API ID
67+
* Operation ID
68+
* Product ID
69+
* User ID
70+
* Subscription ID
71+
* Location
72+
* Gateway ID
73+
74+
## Usage
75+
76+
- [**Policy sections:**](./api-management-howto-policies.md#sections) inbound
77+
- [**Policy scopes:**](./api-management-howto-policies.md#scopes) global, workspace, product, API, operation
78+
- [**Gateways:**](api-management-gateways-overview.md) classic, v2, consumption, self-hosted, workspace
79+
80+
### Usage notes
81+
82+
* This policy can be used multiple times per policy definition.
83+
* You can configure at most 10 custom dimensions for this policy.
84+
* Where available, values in the usage section of the response from the LLM API are used to determine token metrics.
85+
* Certain LLM endpoints support streaming of responses. When `stream` is set to `true` in the API request to enable streaming, token metrics are estimated.
86+
87+
## Example
88+
89+
The following example sends LLM token count metrics to Application Insights along with User ID, Client IP, and API ID as dimensions.
90+
91+
```xml
92+
<policies>
93+
<inbound>
94+
<llm-emit-token-metric
95+
namespace="MyLLM">
96+
<dimension name="User ID" />
97+
<dimension name="Client IP" value="@(context.Request.IpAddress)" />
98+
<dimension name="API ID" />
99+
</llm-emit-token-metric>
100+
</inbound>
101+
<outbound>
102+
</outbound>
103+
</policies>
104+
```
105+
106+
## Related policies
107+
108+
* [Logging](api-management-policies.md#logging)
109+
* [emit-metric](emit-metric-policy.md) policy
110+
* [azure-openai-emit-token-metric](azure-openai-emit-token-metric-policy.md) policy
111+
* [llm-token-limit](llm-token-limit-policy.md) policy
112+
113+
[!INCLUDE [api-management-policy-ref-next-steps](../../includes/api-management-policy-ref-next-steps.md)]
Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
---
2+
title: Enable semantic caching for LLM APIs in Azure API Management
3+
description: Prerequisites and configuration steps to enable semantic caching for large language model APIs in Azure API Management.
4+
author: dlepow
5+
ms.service: azure-api-management
6+
ms.custom:
7+
ms.topic: how-to
8+
ms.date: 08/07/2024
9+
ms.author: danlep
10+
ms.collection: ce-skilling-ai-copilot
11+
---
12+
13+
# Enable semantic caching for large language APIs in Azure API Management
14+
15+
[!INCLUDE [api-management-availability-basicv2-standardv2](../../includes/api-management-availability-basicv2-standardv2.md)]
16+
17+
Enable semantic caching of responses to large language model (LLM) API requests to reduce bandwidth and processing requirements imposed on the backend APIs and lower latency perceived by API consumers. With semantic caching, you can return cached responses for identical prompts and also for prompts that are similar in meaning, even if the text isn't the same. For background, see [Tutorial: Use Azure Cache for Redis as a semantic cache](../azure-cache-for-redis/cache-tutorial-semantic-cache.md).
18+
19+
## Prerequisites
20+
21+
* One or more LLM APIs must be added to your API Management instance. For more information, see [TBD...].
22+
* Prerequisites for Azure AI Model Inference AI [TBD...].
23+
* The API Management instance must be configured to use managed identity authentication to the LLM APIs. For more information, see [Authenticate and authorize access to Azure OpenAI APIs using Azure API Management ](api-management-authenticate-authorize-azure-openai.md#authenticate-with-managed-identity).
24+
* [Azure Cache for Redis Enterprise](../azure-cache-for-redis/quickstart-create-redis-enterprise.md). The **RediSearch** module must be enabled on the Redis Enterprise cache.
25+
> [!NOTE]
26+
> You can only enable the **RediSearch** module when creating a new Redis Enterprise cache. You can't add a module to an existing cache. [Learn more](../azure-cache-for-redis/cache-redis-modules.md)
27+
* External cache configured in the Azure API Management instance. For steps, see [Use an external Azure Cache for Redis in Azure API Management](api-management-howto-cache-external.md).
28+
29+
<!-- The following steps are for AOAI. Revise for Azure AI Model Inference API -->
30+
## Test Chat API deployment
31+
32+
First, test the Azure OpenAI deployment to ensure that the Chat Completion API or Chat API is working as expected. For steps, see [Import an Azure OpenAI API to Azure API Management](llm-api-from-specification.md#test-the-llm-api).
33+
34+
For example, test the Azure OpenAI Chat API by sending a POST request to the API endpoint with a prompt in the request body. The response should include the completion of the prompt. Example request:
35+
36+
```rest
37+
POST https://my-api-management.azure-api.net/my-api/openai/deployments/chat-deployment/chat/completions?api-version=2024-02-01
38+
```
39+
40+
with request body:
41+
42+
```json
43+
{"messages":[{"role":"user","content":"Hello"}]}
44+
```
45+
46+
When the request succeeds, the response includes a completion for the chat message.
47+
48+
## Create a backend for Embeddings API
49+
50+
Configure a [backend](backends.md) resource for the Embeddings API deployment with the following settings:
51+
52+
* **Name** - A name of your choice, such as `embeddings-backend`. You use this name to reference the backend in policies.
53+
* **Type** - Select **Custom URL**.
54+
* **Runtime URL** - The URL of the Embeddings API deployment in the Azure OpenAI Service, similar to:
55+
```
56+
https://my-aoai.openai.azure.com/openai/deployments/embeddings-deployment/embeddings
57+
```
58+
### Test backend
59+
60+
To test the backend, create an API operation for your Azure OpenAI Service API:
61+
62+
1. On the **Design** tab of your API, select **+ Add operation**.
63+
1. Enter a **Display name** and optionally a **Name** for the operation.
64+
1. In the **Frontend** section, in **URL**, select **POST** and enter the path `/`.
65+
1. On the **Headers** tab, add a required header with the name `Content-Type` and value `application/json`.
66+
1. Select **Save**
67+
68+
Configure the following policies in the **Inbound processing** section of the API operation. In the [set-backend-service](set-backend-service-policy.md) policy, substitute the name of the backend you created.
69+
70+
```xml
71+
<policies>
72+
<inbound>
73+
<set-backend-service backend-id="embeddings-backend" />
74+
<authentication-managed-identity resource="https://cognitiveservices.azure.com/" />
75+
[...]
76+
</inbound>
77+
[...]
78+
</policies>
79+
```
80+
81+
On the **Test** tab, test the operation by adding an `api-version` query parameter with value such as `2024-02-01`. Provide a valid request body. For example:
82+
83+
```json
84+
{"input":"Hello"}
85+
```
86+
87+
If the request is successful, the response includes a vector representation of the input text:
88+
89+
```json
90+
{
91+
"object": "list",
92+
"data": [{
93+
"object": "embedding",
94+
"index": 0,
95+
"embedding": [
96+
-0.021829502,
97+
-0.007157768,
98+
-0.028619017,
99+
[...]
100+
]
101+
}]
102+
}
103+
104+
```
105+
106+
## Configure semantic caching policies
107+
108+
Configure the following policies to enable semantic caching for Azure OpenAI APIs in Azure API Management:
109+
* In the **Inbound processing** section for the API, add the [llm-semantic-cache-lookup](llm-semantic-cache-lookup-policy.md) policy. In the `embeddings-backend-id` attribute, specify the Embeddings API backend you created.
110+
111+
Example:
112+
113+
```xml
114+
<llm-semantic-cache-lookup
115+
score-threshold="0.8"
116+
embeddings-backend-id="embeddings-deployment"
117+
embeddings-backend-auth="system-assigned"
118+
ignore-system-messages="true"
119+
max-message-count="10">
120+
<vary-by>@(context.Subscription.Id)</vary-by>
121+
</llm-semantic-cache-lookup>
122+
123+
* In the **Outbound processing** section for the API, add the [llm-semantic-cache-store](llm-semantic-cache-store-policy.md) policy.
124+
125+
Example:
126+
127+
```xml
128+
<llm-semantic-cache-store duration="60" />
129+
```
130+
131+
## Confirm caching
132+
133+
To confirm that semantic caching is working as expected, trace a test Completion or Chat Completion operation using the test console in the portal. Confirm that the cache was used on subsequent tries by inspecting the trace. [Learn more about tracing API calls in Azure API Management](api-management-howto-api-inspector.md).
134+
135+
For example, if the cache was used, the **Output** section includes entries similar to ones in the following screenshot:
136+
137+
:::image type="content" source="media/llm-enable-semantic-caching/cache-lookup.png" alt-text="Screenshot of request trace in the Azure portal.":::
138+
139+
## Related content
140+
141+
* [Caching policies](api-management-policies.md#caching)
142+
* [Azure Cache for Redis](../azure-cache-for-redis/cache-overview.md)
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
---
2+
title: Azure API Management policy reference - llm-semantic-cache-lookup | Microsoft Docs
3+
description: Reference for the llm-semantic-cache-lookup policy available for use in Azure API Management. Provides policy usage, settings, and examples.
4+
services: api-management
5+
author: dlepow
6+
7+
ms.service: azure-api-management
8+
ms.collection: ce-skilling-ai-copilot
9+
ms.custom:
10+
- build-2024
11+
ms.topic: article
12+
ms.date: 08/07/2024
13+
ms.author: danlep
14+
---
15+
16+
# Get cached responses of Azure OpenAI API requests
17+
18+
[!INCLUDE [api-management-availability-all-tiers](../../includes/api-management-availability-all-tiers.md)]
19+
20+
Use the `llm-semantic-cache-lookup` policy to perform cache lookup of responses to Azure OpenAI Chat Completion API and Completion API requests from a configured external cache, based on vector proximity of the prompt to previous requests and a specified similarity score threshold. Response caching reduces bandwidth and processing requirements imposed on the backend Azure OpenAI API and lowers latency perceived by API consumers.
21+
22+
> [!NOTE]
23+
> * This policy must have a corresponding [Cache responses to Azure OpenAI API requests](llm-semantic-cache-store-policy.md) policy.
24+
> * For prerequisites and steps to enable semantic caching, see [Enable semantic caching for Azure OpenAI APIs in Azure API Management](llm-enable-semantic-caching.md).
25+
> * Currently, this policy is in preview.
26+
27+
[!INCLUDE [api-management-policy-generic-alert](../../includes/api-management-policy-generic-alert.md)]
28+
29+
## Policy statement
30+
31+
```xml
32+
<llm-semantic-cache-lookup
33+
score-threshold="similarity score threshold"
34+
embeddings-backend-id ="backend entity ID for embeddings API"
35+
embeddings-backend-auth ="system-assigned"
36+
ignore-system-messages="true | false"
37+
max-message-count="count" >
38+
<vary-by>"expression to partition caching"</vary-by>
39+
</llm-semantic-cache-lookup>
40+
```
41+
42+
## Attributes
43+
44+
| Attribute | Description | Required | Default |
45+
| ----------------- | ------------------------------------------------------ | -------- | ------- |
46+
| score-threshold | Similarity score threshold used to determine whether to return a cached response to a prompt. Value is a decimal between 0.0 and 1.0. [Learn more](../azure-cache-for-redis/cache-tutorial-semantic-cache.md#change-the-similarity-threshold). | Yes | N/A |
47+
| embeddings-backend-id | [Backend](backends.md) ID for OpenAI embeddings API call. | Yes | N/A |
48+
| embeddings-backend-auth | Authentication used for Azure OpenAI embeddings API backend. | Yes. Must be set to `system-assigned`. | N/A |
49+
| ignore-system-messages | Boolean. If set to `true`, removes system messages from a GPT chat completion prompt before assessing cache similarity. | No | false |
50+
| max-message-count | If specified, number of remaining dialog messages after which caching is skipped. | No | N/A |
51+
52+
## Elements
53+
54+
|Name|Description|Required|
55+
|----------|-----------------|--------------|
56+
|vary-by| A custom expression determined at runtime whose value partitions caching. If multiple `vary-by` elements are added, values are concatenated to create a unique combination. | No |
57+
58+
## Usage
59+
60+
61+
- [**Policy sections:**](./api-management-howto-policies.md#sections) inbound
62+
- [**Policy scopes:**](./api-management-howto-policies.md#scopes) global, product, API, operation
63+
- [**Gateways:**](api-management-gateways-overview.md) v2
64+
65+
### Usage notes
66+
67+
- This policy can only be used once in a policy section.
68+
69+
70+
## Examples
71+
72+
### Example with corresponding llm-semantic-cache-store policy
73+
74+
[!INCLUDE [api-management-semantic-cache-example](../../includes/api-management-semantic-cache-example.md)]
75+
76+
## Related policies
77+
78+
* [Caching](api-management-policies.md#caching)
79+
* [llm-semantic-cache-store](llm-semantic-cache-store-policy.md)
80+
81+
[!INCLUDE [api-management-policy-ref-next-steps](../../includes/api-management-policy-ref-next-steps.md)]

0 commit comments

Comments
 (0)