Skip to content

Commit fe4f1e4

Browse files
authored
Merge pull request #283934 from dlepow/apimllm
[APIM] LLM API policies
2 parents 933ff93 + 747ff1e commit fe4f1e4

9 files changed

+397
-7
lines changed

articles/api-management/TOC.yml

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -217,7 +217,7 @@
217217
href: sap-api.md
218218
- name: Import gRPC API
219219
href: grpc-api.md
220-
- name: Azure OpenAI
220+
- name: Azure OpenAI and LLM APIs
221221
items:
222222
- name: Import Azure OpenAI API
223223
href: azure-openai-api-from-specification.md
@@ -543,6 +543,14 @@
543543
href: json-to-xml-policy.md
544544
- name: limit-concurrency
545545
href: limit-concurrency-policy.md
546+
- name: llm-emit-token-metric
547+
href: llm-emit-token-metric-policy.md
548+
- name: llm-semantic-cache-lookup
549+
href: llm-semantic-cache-lookup-policy.md
550+
- name: llm-semantic-cache-store
551+
href: llm-semantic-cache-store-policy.md
552+
- name: llm-token-limit
553+
href: llm-token-limit-policy.md
546554
- name: log-to-eventhub
547555
href: log-to-eventhub-policy.md
548556
- name: mock-response

articles/api-management/api-management-policies.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,8 @@ More information about policies:
3636
| [Set usage quota by subscription](quota-policy.md) | Allows you to enforce a renewable or lifetime call volume and/or bandwidth quota, on a per subscription basis. | Yes | Yes | Yes | Yes
3737
| [Set usage quota by key](quota-by-key-policy.md) | Allows you to enforce a renewable or lifetime call volume and/or bandwidth quota, on a per key basis. | Yes | No | No | Yes |
3838
| [Limit concurrency](limit-concurrency-policy.md) | Prevents enclosed policies from executing by more than the specified number of requests at a time. | Yes | Yes | Yes | Yes |
39-
| [Limit Azure OpenAI Service token usage](azure-openai-token-limit-policy.md) | Prevents Azure OpenAI API usage spikes by limiting language model tokens per calculated key. | Yes | Yes | No | No |
39+
| [Limit Azure OpenAI Service token usage](azure-openai-token-limit-policy.md) | Prevents Azure OpenAI API usage spikes by limiting large language model tokens per calculated key. | Yes | Yes | No | No |
40+
| [Limit large language model API token usage](llm-token-limit-policy.md) | Prevents large language model (LLM) API usage spikes by limiting LLM tokens per calculated key. | Yes | Yes | No | No |
4041

4142
## Authentication and authorization
4243

@@ -80,8 +81,10 @@ More information about policies:
8081
| [Get value from cache](cache-lookup-value-policy.md) | Retrieves a cached item by key. | Yes | Yes | Yes | Yes |
8182
| [Store value in cache](cache-store-value-policy.md) | Stores an item in the cache by key. | Yes | Yes | Yes | Yes |
8283
| [Remove value from cache](cache-remove-value-policy.md) | Removes an item in the cache by key. | Yes | Yes | Yes | Yes |
83-
| [Get cached responses of Azure OpenAI API requests](azure-openai-semantic-cache-lookup-policy.md) | Performs cache lookup using semantic search and returns a valid cached response when available. | Yes | Yes | Yes | Yes |
84+
| [Get cached responses of Azure OpenAI API requests](azure-openai-semantic-cache-lookup-policy.md) | Performs lookup in Azure OpenAI API cache using semantic search and returns a valid cached response when available. | Yes | Yes | Yes | Yes |
8485
| [Store responses of Azure OpenAI API requests to cache](azure-openai-semantic-cache-store-policy.md) | Caches response according to the Azure OpenAI API cache configuration. | Yes | Yes | Yes | Yes |
86+
| [Get cached responses of large language model API requests](llm-semantic-cache-lookup-policy.md) | Performs lookup in large language model API cache using semantic search and returns a valid cached response when available. | Yes | Yes | Yes | Yes |
87+
| [Store responses of large language model API requests to cache](llm-semantic-cache-store-policy.md) | Caches response according to the large language model API cache configuration. | Yes | Yes | Yes | Yes |
8588

8689

8790

@@ -130,7 +133,8 @@ More information about policies:
130133
|---------|---------|---------|---------|---------|--------|
131134
| [Trace](trace-policy.md) | Adds custom traces into the [request tracing](./api-management-howto-api-inspector.md) output in the test console, Application Insights telemetries, and resource logs. | Yes | Yes<sup>1</sup> | Yes | Yes |
132135
| [Emit metrics](emit-metric-policy.md) | Sends custom metrics to Application Insights at execution. | Yes | Yes | Yes | Yes |
133-
| [Emit Azure OpenAI token metrics](azure-openai-emit-token-metric-policy.md) | Sends metrics to Application Insights for consumption of language model tokens through Azure OpenAI service APIs. | Yes | Yes | No | No |
136+
| [Emit Azure OpenAI token metrics](azure-openai-emit-token-metric-policy.md) | Sends metrics to Application Insights for consumption of large language model tokens through Azure OpenAI service APIs. | Yes | Yes | No | No |
137+
| [Emit large language model API token metrics](llm-emit-token-metric-policy.md) | Sends metrics to Application Insights for consumption of large language model (LLM) tokens through LLM APIs. | Yes | Yes | No | No |
134138

135139

136140
<sup>1</sup> In the V2 gateway, the `trace` policy currently does not add tracing output in the test console.

articles/api-management/azure-openai-enable-semantic-caching.md

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,9 @@ ms.collection: ce-skilling-ai-copilot
1717

1818
Enable semantic caching of responses to Azure OpenAI API requests to reduce bandwidth and processing requirements imposed on the backend APIs and lower latency perceived by API consumers. With semantic caching, you can return cached responses for identical prompts and also for prompts that are similar in meaning, even if the text isn't the same. For background, see [Tutorial: Use Azure Cache for Redis as a semantic cache](../azure-cache-for-redis/cache-tutorial-semantic-cache.md).
1919

20+
> [!NOTE]
21+
> The configuration steps in this article enable semantic caching for Azure OpenAI APIs. These steps can be generalized to enable semantic caching for corresponding large language model (LLM) APIs available through the [Azure AI Model Inference API](../ai-studio/reference/reference-model-inference-api.md).
22+
2023
## Prerequisites
2124

2225
* One or more Azure OpenAI Service APIs must be added to your API Management instance. For more information, see [Add an Azure OpenAI Service API to Azure API Management](azure-openai-api-from-specification.md).
@@ -48,13 +51,13 @@ with request body:
4851

4952
When the request succeeds, the response includes a completion for the chat message.
5053

51-
## Create a backend for Embeddings API
54+
## Create a backend for embeddings API
5255

53-
Configure a [backend](backends.md) resource for the Embeddings API deployment with the following settings:
56+
Configure a [backend](backends.md) resource for the embeddings API deployment with the following settings:
5457

5558
* **Name** - A name of your choice, such as `embeddings-backend`. You use this name to reference the backend in policies.
5659
* **Type** - Select **Custom URL**.
57-
* **Runtime URL** - The URL of the Embeddings API deployment in the Azure OpenAI Service, similar to:
60+
* **Runtime URL** - The URL of the embeddings API deployment in the Azure OpenAI Service, similar to:
5861
```
5962
https://my-aoai.openai.azure.com/openai/deployments/embeddings-deployment/embeddings
6063
```
@@ -111,6 +114,9 @@ If the request is successful, the response includes a vector representation of t
111114
Configure the following policies to enable semantic caching for Azure OpenAI APIs in Azure API Management:
112115
* In the **Inbound processing** section for the API, add the [azure-openai-semantic-cache-lookup](azure-openai-semantic-cache-lookup-policy.md) policy. In the `embeddings-backend-id` attribute, specify the Embeddings API backend you created.
113116

117+
> [!NOTE]
118+
> When enabling semantic caching for other large language model APIs, use the [llm-semantic-cache-lookup](llm-semantic-cache-lookup-policy.md) policy instead.
119+
114120
Example:
115121

116122
```xml
@@ -125,6 +131,9 @@ Configure the following policies to enable semantic caching for Azure OpenAI API
125131

126132
* In the **Outbound processing** section for the API, add the [azure-openai-semantic-cache-store](azure-openai-semantic-cache-store-policy.md) policy.
127133

134+
> [!NOTE]
135+
> When enabling semantic caching for other large language model APIs, use the [llm-semantic-cache-store](llm-semantic-cache-store-policy.md) policy instead.
136+
128137
Example:
129138

130139
```xml

articles/api-management/azure-openai-token-limit-policy.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,7 @@ In the following example, the token limit of 5000 per minute is keyed by the cal
8787
## Related policies
8888

8989
* [Rate limiting and quotas](api-management-policies.md#rate-limiting-and-quotas)
90+
* [llm-token-limit](llm-token-limit-policy.md) policy
9091
* [azure-openai-emit-token-metric](azure-openai-emit-token-metric-policy.md) policy
9192

9293
[!INCLUDE [api-management-policy-ref-next-steps](../../includes/api-management-policy-ref-next-steps.md)]
Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
---
2+
title: Azure API Management policy reference - llm-emit-token-metric
3+
description: Reference for the llm-emit-token-metric policy available for use in Azure API Management. Provides policy usage, settings, and examples.
4+
services: api-management
5+
author: dlepow
6+
7+
ms.service: azure-api-management
8+
ms.topic: article
9+
ms.date: 08/08/2024
10+
ms.author: danlep
11+
ms.collection: ce-skilling-ai-copilot
12+
ms.custom:
13+
---
14+
15+
# Emit metrics for consumption of large language model tokens
16+
17+
[!INCLUDE [api-management-availability-all-tiers](../../includes/api-management-availability-all-tiers.md)]
18+
19+
The `llm-emit-token-metric` policy sends metrics to Application Insights about consumption of large language model (LLM) tokens through LLM APIs. Token count metrics include: Total Tokens, Prompt Tokens, and Completion Tokens.
20+
21+
> [!NOTE]
22+
> Currently, this policy is in preview.
23+
24+
[!INCLUDE [api-management-policy-generic-alert](../../includes/api-management-policy-generic-alert.md)]
25+
26+
[!INCLUDE [api-management-llm-models](../../includes/api-management-llm-models.md)]
27+
28+
29+
## Prerequisites
30+
31+
* One or more LLM APIs must be added to your API Management instance.
32+
* Your API Management instance must be integrated with Application insights. For more information, see [How to integrate Azure API Management with Azure Application Insights](./api-management-howto-app-insights.md#create-a-connection-using-the-azure-portal).
33+
* Enable Application Insights logging for your LLM APIs.
34+
* Enable custom metrics with dimensions in Application Insights. For more information, see [Emit custom metrics](api-management-howto-app-insights.md#emit-custom-metrics).
35+
36+
## Policy statement
37+
38+
```xml
39+
<llm-emit-token-metric
40+
namespace="metric namespace" >
41+
<dimension name="dimension name" value="dimension value" />
42+
...additional dimensions...
43+
</llm-emit-token-metric>
44+
```
45+
46+
## Attributes
47+
48+
| Attribute | Description | Required | Default value |
49+
| --------- | -------------------------- | ------------------ | -------------- |
50+
| namespace | A string. Namespace of metric. Policy expressions aren't allowed. | No | API Management |
51+
| value | Value of metric expressed as a double. Policy expressions are allowed. | No | 1 |
52+
53+
54+
## Elements
55+
56+
| Element | Description | Required |
57+
| ----------- | --------------------------------------------------------------------------------- | -------- |
58+
| dimension | Add one or more of these elements for each dimension included in the metric. | Yes |
59+
60+
### dimension attributes
61+
62+
| Attribute | Description | Required | Default value |
63+
| --------- | -------------------------- | ------------------ | -------------- |
64+
| name | A string or policy expression. Name of dimension. | Yes | N/A |
65+
| value | A string or policy expression. Value of dimension. Can only be omitted if `name` matches one of the default dimensions. If so, value is provided as per dimension name. | No | N/A |
66+
67+
### Default dimension names that may be used without value
68+
69+
* API ID
70+
* Operation ID
71+
* Product ID
72+
* User ID
73+
* Subscription ID
74+
* Location
75+
* Gateway ID
76+
77+
## Usage
78+
79+
- [**Policy sections:**](./api-management-howto-policies.md#sections) inbound
80+
- [**Policy scopes:**](./api-management-howto-policies.md#scopes) global, workspace, product, API, operation
81+
- [**Gateways:**](api-management-gateways-overview.md) classic, v2, consumption, self-hosted, workspace
82+
83+
### Usage notes
84+
85+
* This policy can be used multiple times per policy definition.
86+
* You can configure at most 10 custom dimensions for this policy.
87+
* Where available, values in the usage section of the response from the LLM API are used to determine token metrics.
88+
* Certain LLM endpoints support streaming of responses. When `stream` is set to `true` in the API request to enable streaming, token metrics are estimated.
89+
90+
## Example
91+
92+
The following example sends LLM token count metrics to Application Insights along with User ID, Client IP, and API ID as dimensions.
93+
94+
```xml
95+
<policies>
96+
<inbound>
97+
<llm-emit-token-metric
98+
namespace="MyLLM">
99+
<dimension name="User ID" />
100+
<dimension name="Client IP" value="@(context.Request.IpAddress)" />
101+
<dimension name="API ID" />
102+
</llm-emit-token-metric>
103+
</inbound>
104+
<outbound>
105+
</outbound>
106+
</policies>
107+
```
108+
109+
## Related policies
110+
111+
* [Logging](api-management-policies.md#logging)
112+
* [emit-metric](emit-metric-policy.md) policy
113+
* [azure-openai-emit-token-metric](azure-openai-emit-token-metric-policy.md) policy
114+
* [llm-token-limit](llm-token-limit-policy.md) policy
115+
116+
[!INCLUDE [api-management-policy-ref-next-steps](../../includes/api-management-policy-ref-next-steps.md)]
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
---
2+
title: Azure API Management policy reference - llm-semantic-cache-lookup | Microsoft Docs
3+
description: Reference for the llm-semantic-cache-lookup policy available for use in Azure API Management. Provides policy usage, settings, and examples.
4+
services: api-management
5+
author: dlepow
6+
7+
ms.service: azure-api-management
8+
ms.collection: ce-skilling-ai-copilot
9+
ms.custom:
10+
- build-2024
11+
ms.topic: article
12+
ms.date: 08/07/2024
13+
ms.author: danlep
14+
---
15+
16+
# Get cached responses of large language model API requests
17+
18+
[!INCLUDE [api-management-availability-all-tiers](../../includes/api-management-availability-all-tiers.md)]
19+
20+
Use the `llm-semantic-cache-lookup` policy to perform cache lookup of responses to large language model (LLM) API requests from a configured external cache, based on vector proximity of the prompt to previous requests and a specified similarity score threshold. Response caching reduces bandwidth and processing requirements imposed on the backend LLM API and lowers latency perceived by API consumers.
21+
22+
> [!NOTE]
23+
> * This policy must have a corresponding [Cache responses to large language model API requests](llm-semantic-cache-store-policy.md) policy.
24+
> * For prerequisites and steps to enable semantic caching, see [Enable semantic caching for Azure OpenAI APIs in Azure API Management](azure-openai-enable-semantic-caching.md).
25+
> * Currently, this policy is in preview.
26+
27+
[!INCLUDE [api-management-policy-generic-alert](../../includes/api-management-policy-generic-alert.md)]
28+
29+
## Policy statement
30+
31+
```xml
32+
<llm-semantic-cache-lookup
33+
score-threshold="similarity score threshold"
34+
embeddings-backend-id ="backend entity ID for embeddings API"
35+
embeddings-backend-auth ="system-assigned"
36+
ignore-system-messages="true | false"
37+
max-message-count="count" >
38+
<vary-by>"expression to partition caching"</vary-by>
39+
</llm-semantic-cache-lookup>
40+
```
41+
42+
## Attributes
43+
44+
| Attribute | Description | Required | Default |
45+
| ----------------- | ------------------------------------------------------ | -------- | ------- |
46+
| score-threshold | Similarity score threshold used to determine whether to return a cached response to a prompt. Value is a decimal between 0.0 and 1.0. [Learn more](../azure-cache-for-redis/cache-tutorial-semantic-cache.md#change-the-similarity-threshold). | Yes | N/A |
47+
| embeddings-backend-id | [Backend](backends.md) ID for OpenAI embeddings API call. | Yes | N/A |
48+
| embeddings-backend-auth | Authentication used for Azure OpenAI embeddings API backend. | Yes. Must be set to `system-assigned`. | N/A |
49+
| ignore-system-messages | Boolean. If set to `true`, removes system messages from a GPT chat completion prompt before assessing cache similarity. | No | false |
50+
| max-message-count | If specified, number of remaining dialog messages after which caching is skipped. | No | N/A |
51+
52+
## Elements
53+
54+
|Name|Description|Required|
55+
|----------|-----------------|--------------|
56+
|vary-by| A custom expression determined at runtime whose value partitions caching. If multiple `vary-by` elements are added, values are concatenated to create a unique combination. | No |
57+
58+
## Usage
59+
60+
61+
- [**Policy sections:**](./api-management-howto-policies.md#sections) inbound
62+
- [**Policy scopes:**](./api-management-howto-policies.md#scopes) global, product, API, operation
63+
- [**Gateways:**](api-management-gateways-overview.md) v2
64+
65+
### Usage notes
66+
67+
- This policy can only be used once in a policy section.
68+
69+
70+
## Examples
71+
72+
### Example with corresponding llm-semantic-cache-store policy
73+
74+
[!INCLUDE [api-management-semantic-cache-example](../../includes/api-management-semantic-cache-example.md)]
75+
76+
## Related policies
77+
78+
* [Caching](api-management-policies.md#caching)
79+
* [llm-semantic-cache-store](llm-semantic-cache-store-policy.md)
80+
81+
[!INCLUDE [api-management-policy-ref-next-steps](../../includes/api-management-policy-ref-next-steps.md)]
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
---
2+
title: Azure API Management policy reference - llm-semantic-cache-store
3+
description: Reference for the llm-semantic-cache-store policy available for use in Azure API Management. Provides policy usage, settings, and examples.
4+
services: api-management
5+
author: dlepow
6+
7+
ms.service: azure-api-management
8+
ms.collection: ce-skilling-ai-copilot
9+
ms.custom:
10+
ms.topic: article
11+
ms.date: 08/08/2024
12+
ms.author: danlep
13+
---
14+
15+
# Cache responses to large language model API requests
16+
17+
[!INCLUDE [api-management-availability-all-tiers](../../includes/api-management-availability-all-tiers.md)]
18+
19+
The `llm-semantic-cache-store` policy caches responses to chat completion API and completion API requests to a configured external cache. Response caching reduces bandwidth and processing requirements imposed on the backend Azure OpenAI API and lowers latency perceived by API consumers.
20+
21+
> [!NOTE]
22+
> * This policy must have a corresponding [Get cached responses to large language model API requests](llm-semantic-cache-lookup-policy.md) policy.
23+
> * For prerequisites and steps to enable semantic caching, see [Enable semantic caching for Azure OpenAI APIs in Azure API Management](azure-openai-enable-semantic-caching.md).
24+
> * Currently, this policy is in preview.
25+
26+
[!INCLUDE [api-management-policy-generic-alert](../../includes/api-management-policy-generic-alert.md)]
27+
28+
## Policy statement
29+
30+
```xml
31+
<llm-semantic-cache-store duration="seconds"/>
32+
```
33+
34+
35+
## Attributes
36+
37+
| Attribute | Description | Required | Default |
38+
| ----------------- | ------------------------------------------------------ | -------- | ------- |
39+
| duration | Time-to-live of the cached entries, specified in seconds. Policy expressions are allowed. | Yes | N/A |
40+
41+
42+
## Usage
43+
44+
- [**Policy sections:**](./api-management-howto-policies.md#sections) outbound
45+
- [**Policy scopes:**](./api-management-howto-policies.md#scopes) global, product, API, operation
46+
- [**Gateways:**](api-management-gateways-overview.md) v2
47+
48+
### Usage notes
49+
50+
- This policy can only be used once in a policy section.
51+
- If the cache lookup fails, the API call that uses the cache-related operation doesn't raise an error, and the cache operation completes successfully.
52+
53+
## Examples
54+
55+
### Example with corresponding llm-semantic-cache-lookup policy
56+
57+
[!INCLUDE [api-management-semantic-cache-example](../../includes/api-management-semantic-cache-example.md)]
58+
59+
## Related policies
60+
61+
* [Caching](api-management-policies.md#caching)
62+
* [llm-semantic-cache-lookup](llm-semantic-cache-lookup-policy.md)
63+
64+
[!INCLUDE [api-management-policy-ref-next-steps](../../includes/api-management-policy-ref-next-steps.md)]

0 commit comments

Comments
 (0)