Merge pull request #283934 from dlepow/apimllm

denrea · web-flow · commit fe4f1e462fb7 · 2024-08-09T09:14:22.000-07:00
[APIM] LLM API policies
diff --git a/articles/api-management/TOC.yml b/articles/api-management/TOC.yml
@@ -217,7 +217,7 @@
         href: sap-api.md
     - name: Import gRPC API
       href: grpc-api.md
-    - name: Azure OpenAI
+    - name: Azure OpenAI and LLM APIs
       items:
       - name: Import Azure OpenAI API
         href: azure-openai-api-from-specification.md
@@ -543,6 +543,14 @@
       href: json-to-xml-policy.md
     - name: limit-concurrency
       href: limit-concurrency-policy.md
+    - name: llm-emit-token-metric
+      href: llm-emit-token-metric-policy.md
+    - name: llm-semantic-cache-lookup
+      href: llm-semantic-cache-lookup-policy.md
+    - name: llm-semantic-cache-store
+      href: llm-semantic-cache-store-policy.md
+    - name: llm-token-limit
+      href: llm-token-limit-policy.md
     - name: log-to-eventhub
       href: log-to-eventhub-policy.md
     - name: mock-response
diff --git a/articles/api-management/api-management-policies.md b/articles/api-management/api-management-policies.md
@@ -36,7 +36,8 @@ More information about policies:
 | [Set usage quota by subscription](quota-policy.md) | Allows you to enforce a renewable or lifetime call volume and/or bandwidth quota, on a per subscription basis. | Yes | Yes | Yes | Yes
 | [Set usage quota by key](quota-by-key-policy.md) |  Allows you to enforce a renewable or lifetime call volume and/or bandwidth quota, on a per key basis. | Yes | No | No | Yes | 
 | [Limit concurrency](limit-concurrency-policy.md) | Prevents enclosed policies from executing by more than the specified number of requests at a time. | Yes | Yes | Yes | Yes |
-| [Limit Azure OpenAI Service token usage](azure-openai-token-limit-policy.md) | Prevents Azure OpenAI API usage spikes by limiting language model tokens per calculated key. | Yes | Yes | No | No |
+| [Limit Azure OpenAI Service token usage](azure-openai-token-limit-policy.md) | Prevents Azure OpenAI API usage spikes by limiting large language model tokens per calculated key. | Yes | Yes | No | No |
+| [Limit large language model API token usage](llm-token-limit-policy.md) | Prevents large language model (LLM) API usage spikes by limiting LLM tokens per calculated key. | Yes | Yes | No | No |
 
 ## Authentication and authorization
 
@@ -80,8 +81,10 @@ More information about policies:
 |  [Get value from cache](cache-lookup-value-policy.md) | Retrieves a cached item by key. | Yes | Yes | Yes | Yes |
 |  [Store value in cache](cache-store-value-policy.md) | Stores an item in the cache by key. | Yes | Yes | Yes | Yes |
 |  [Remove value from cache](cache-remove-value-policy.md) | Removes an item in the cache by key. | Yes | Yes | Yes | Yes |
-|  [Get cached responses of Azure OpenAI API requests](azure-openai-semantic-cache-lookup-policy.md) | Performs cache lookup using semantic search and returns a valid cached response when available. | Yes | Yes | Yes | Yes |
+|  [Get cached responses of Azure OpenAI API requests](azure-openai-semantic-cache-lookup-policy.md) | Performs lookup in Azure OpenAI API cache using semantic search and returns a valid cached response when available. | Yes | Yes | Yes | Yes |
 |  [Store responses of Azure OpenAI API requests to cache](azure-openai-semantic-cache-store-policy.md) | Caches response according to the Azure OpenAI API cache configuration. | Yes | Yes | Yes | Yes |
+|  [Get cached responses of large language model API requests](llm-semantic-cache-lookup-policy.md) | Performs lookup in large language model API cache using semantic search and returns a valid cached response when available. | Yes | Yes | Yes | Yes |
+|  [Store responses of large language model API requests to cache](llm-semantic-cache-store-policy.md) | Caches response according to the large language model API cache configuration. | Yes | Yes | Yes | Yes |
 
 
 
@@ -130,7 +133,8 @@ More information about policies:
 |---------|---------|---------|---------|---------|--------|
 |  [Trace](trace-policy.md) | Adds custom traces into the [request tracing](./api-management-howto-api-inspector.md) output in the test console, Application Insights telemetries, and resource logs. | Yes | Yes<sup>1</sup> | Yes | Yes |
 |  [Emit metrics](emit-metric-policy.md) | Sends custom metrics to Application Insights at execution. | Yes | Yes | Yes | Yes |
-|  [Emit Azure OpenAI token metrics](azure-openai-emit-token-metric-policy.md) | Sends metrics to Application Insights for consumption of language model tokens through Azure OpenAI service APIs. | Yes | Yes | No | No |
+|  [Emit Azure OpenAI token metrics](azure-openai-emit-token-metric-policy.md) | Sends metrics to Application Insights for consumption of large language model tokens through Azure OpenAI service APIs. | Yes | Yes | No | No |
+|  [Emit large language model API token metrics](llm-emit-token-metric-policy.md) | Sends metrics to Application Insights for consumption of large language model (LLM) tokens through LLM APIs. | Yes | Yes | No | No |
 
 
 <sup>1</sup> In the V2 gateway, the `trace` policy currently does not add tracing output in the test console.
diff --git a/articles/api-management/azure-openai-enable-semantic-caching.md b/articles/api-management/azure-openai-enable-semantic-caching.md
@@ -17,6 +17,9 @@ ms.collection: ce-skilling-ai-copilot
 
 Enable semantic caching of responses to Azure OpenAI API requests to reduce bandwidth and processing requirements imposed on the backend APIs and lower latency perceived by API consumers. With semantic caching, you can return cached responses for identical prompts and also for prompts that are similar in meaning, even if the text isn't the same. For background, see [Tutorial: Use Azure Cache for Redis as a semantic cache](../azure-cache-for-redis/cache-tutorial-semantic-cache.md).
 
+> [!NOTE]
+> The configuration steps in this article enable semantic caching for Azure OpenAI APIs. These steps can be generalized to enable semantic caching for corresponding large language model (LLM) APIs available through the [Azure AI Model Inference API](../ai-studio/reference/reference-model-inference-api.md). 
+
 ## Prerequisites
 
 * One or more Azure OpenAI Service APIs must be added to your API Management instance. For more information, see [Add an Azure OpenAI Service API to Azure API Management](azure-openai-api-from-specification.md).
@@ -48,13 +51,13 @@ with request body:
 
 When the request succeeds, the response includes a completion for the chat message.
 
-## Create a backend for Embeddings API
+## Create a backend for embeddings API
 
-Configure a [backend](backends.md) resource for the Embeddings API deployment with the following settings:
+Configure a [backend](backends.md) resource for the embeddings API deployment with the following settings:
 
 * **Name** - A name of your choice, such as `embeddings-backend`. You use this name to reference the backend in policies.
 * **Type** - Select **Custom URL**.
-* **Runtime URL** - The URL of the Embeddings API deployment in the Azure OpenAI Service, similar to:
+* **Runtime URL** - The URL of the embeddings API deployment in the Azure OpenAI Service, similar to:
         ```
         https://my-aoai.openai.azure.com/openai/deployments/embeddings-deployment/embeddings
         ```
@@ -111,6 +114,9 @@ If the request is successful, the response includes a vector representation of t
 Configure the following policies to enable semantic caching for Azure OpenAI APIs in Azure API Management:
 * In the **Inbound processing** section for the API, add the [azure-openai-semantic-cache-lookup](azure-openai-semantic-cache-lookup-policy.md) policy. In the `embeddings-backend-id` attribute, specify the Embeddings API backend you created.
 
+    > [!NOTE]
+    > When enabling semantic caching for other large language model APIs, use the [llm-semantic-cache-lookup](llm-semantic-cache-lookup-policy.md) policy instead.
+
     Example:
 
     ```xml
@@ -125,6 +131,9 @@ Configure the following policies to enable semantic caching for Azure OpenAI API
     
 * In the **Outbound processing** section for the API, add the [azure-openai-semantic-cache-store](azure-openai-semantic-cache-store-policy.md) policy.
 
+    > [!NOTE]
+    > When enabling semantic caching for other large language model APIs, use the [llm-semantic-cache-store](llm-semantic-cache-store-policy.md) policy instead.
+
     Example:
 
     ```xml
diff --git a/articles/api-management/azure-openai-token-limit-policy.md b/articles/api-management/azure-openai-token-limit-policy.md
@@ -87,6 +87,7 @@ In the following example, the token limit of 5000 per minute is keyed by the cal
 ## Related policies
 
 * [Rate limiting and quotas](api-management-policies.md#rate-limiting-and-quotas)
+* [llm-token-limit](llm-token-limit-policy.md) policy
 * [azure-openai-emit-token-metric](azure-openai-emit-token-metric-policy.md) policy
 
 [!INCLUDE [api-management-policy-ref-next-steps](../../includes/api-management-policy-ref-next-steps.md)]
diff --git a/articles/api-management/llm-emit-token-metric-policy.md b/articles/api-management/llm-emit-token-metric-policy.md
@@ -0,0 +1,116 @@
+---
+title: Azure API Management policy reference - llm-emit-token-metric
+description: Reference for the llm-emit-token-metric policy available for use in Azure API Management. Provides policy usage, settings, and examples.
+services: api-management
+author: dlepow
+
+ms.service: azure-api-management
+ms.topic: article
+ms.date: 08/08/2024
+ms.author: danlep
+ms.collection: ce-skilling-ai-copilot
+ms.custom:
+---
+
+# Emit metrics for consumption of large language model tokens
+
+[!INCLUDE [api-management-availability-all-tiers](../../includes/api-management-availability-all-tiers.md)]
+
+The `llm-emit-token-metric` policy sends metrics to Application Insights about consumption of large language model (LLM) tokens through LLM APIs. Token count metrics include: Total Tokens, Prompt Tokens, and Completion Tokens. 
+
+> [!NOTE]
+> Currently, this policy is in preview.
+
+[!INCLUDE [api-management-policy-generic-alert](../../includes/api-management-policy-generic-alert.md)]
+
+[!INCLUDE [api-management-llm-models](../../includes/api-management-llm-models.md)]
+
+
+## Prerequisites
+
+* One or more LLM APIs must be added to your API Management instance. 
+* Your API Management instance must be integrated with Application insights. For more information, see [How to integrate Azure API Management with Azure Application Insights](./api-management-howto-app-insights.md#create-a-connection-using-the-azure-portal).
+* Enable Application Insights logging for your LLM APIs. 
+* Enable custom metrics with dimensions in Application Insights. For more information, see [Emit custom metrics](api-management-howto-app-insights.md#emit-custom-metrics).
+
+## Policy statement
+
+```xml
+<llm-emit-token-metric
+        namespace="metric namespace" >      
+        <dimension name="dimension name" value="dimension value" />
+        ...additional dimensions...
+</llm-emit-token-metric>
+```
+
+## Attributes
+
+| Attribute | Description                | Required                | Default value  |
+| --------- | -------------------------- |  ------------------ | -------------- |
+| namespace | A string. Namespace of metric. Policy expressions aren't allowed. | No        | API Management |
+| value     |  Value of metric expressed as a double. Policy expressions are allowed.   | No           | 1              |
+
+
+## Elements
+
+| Element     | Description                                                                       | Required |
+| ----------- | --------------------------------------------------------------------------------- | -------- |
+| dimension   | Add one or more of these elements for each dimension included in the metric.  | Yes      |
+
+### dimension attributes
+
+| Attribute | Description                | Required |  Default value  |
+| --------- | -------------------------- |  ------------------ | -------------- |
+| name      | A string or policy expression. Name of dimension.      | Yes      |  N/A            |
+| value     | A string or policy expression. Value of dimension. Can only be omitted if `name` matches one of the default dimensions. If so, value is provided as per dimension name. | No        | N/A |
+
+ ### Default dimension names that may be used without value
+
+* API ID
+* Operation ID
+* Product ID
+* User ID
+* Subscription ID
+* Location
+* Gateway ID
+
+## Usage
+
+- [**Policy sections:**](./api-management-howto-policies.md#sections) inbound
+- [**Policy scopes:**](./api-management-howto-policies.md#scopes) global, workspace, product, API, operation
+-  [**Gateways:**](api-management-gateways-overview.md) classic, v2, consumption, self-hosted, workspace
+
+### Usage notes
+
+* This policy can be used multiple times per policy definition.
+* You can configure at most 10 custom dimensions for this policy.
+* Where available, values in the usage section of the response from the LLM API are used to determine token metrics.
+* Certain LLM endpoints support streaming of responses. When `stream` is set to `true` in the API request to enable streaming, token metrics are estimated.
+
+## Example
+
+The following example sends LLM token count metrics to Application Insights along with User ID, Client IP, and API ID as dimensions.
+
+```xml
+<policies>
+  <inbound>
+      <llm-emit-token-metric
+            namespace="MyLLM">   
+            <dimension name="User ID" />
+            <dimension name="Client IP" value="@(context.Request.IpAddress)" />
+            <dimension name="API ID" />
+        </llm-emit-token-metric> 
+  </inbound>
+  <outbound>
+  </outbound>
+</policies>
+```
+
+## Related policies
+
+* [Logging](api-management-policies.md#logging)
+* [emit-metric](emit-metric-policy.md) policy
+* [azure-openai-emit-token-metric](azure-openai-emit-token-metric-policy.md) policy
+* [llm-token-limit](llm-token-limit-policy.md) policy 
+
+[!INCLUDE [api-management-policy-ref-next-steps](../../includes/api-management-policy-ref-next-steps.md)]
diff --git a/articles/api-management/llm-semantic-cache-lookup-policy.md b/articles/api-management/llm-semantic-cache-lookup-policy.md
@@ -0,0 +1,81 @@
+---
+title: Azure API Management policy reference - llm-semantic-cache-lookup | Microsoft Docs
+description: Reference for the llm-semantic-cache-lookup policy available for use in Azure API Management. Provides policy usage, settings, and examples.
+services: api-management
+author: dlepow
+
+ms.service: azure-api-management
+ms.collection: ce-skilling-ai-copilot
+ms.custom:
+  - build-2024
+ms.topic: article
+ms.date: 08/07/2024
+ms.author: danlep
+---
+
+# Get cached responses of large language model API requests
+
+[!INCLUDE [api-management-availability-all-tiers](../../includes/api-management-availability-all-tiers.md)]
+
+Use the `llm-semantic-cache-lookup` policy to perform cache lookup of responses to large language model (LLM) API requests from a configured external cache, based on vector proximity of the prompt to previous requests and a specified similarity score threshold. Response caching reduces bandwidth and processing requirements imposed on the backend LLM API and lowers latency perceived by API consumers.
+
+> [!NOTE]
+> * This policy must have a corresponding [Cache responses to large language model API requests](llm-semantic-cache-store-policy.md) policy. 
+> * For prerequisites and steps to enable semantic caching, see [Enable semantic caching for Azure OpenAI APIs in Azure API Management](azure-openai-enable-semantic-caching.md).
+> * Currently, this policy is in preview.
+
+[!INCLUDE [api-management-policy-generic-alert](../../includes/api-management-policy-generic-alert.md)]
+
+## Policy statement
+
+```xml
+<llm-semantic-cache-lookup
+    score-threshold="similarity score threshold"
+    embeddings-backend-id ="backend entity ID for embeddings API"
+    embeddings-backend-auth ="system-assigned"             
+    ignore-system-messages="true | false"      
+    max-message-count="count" >
+    <vary-by>"expression to partition caching"</vary-by>
+</llm-semantic-cache-lookup>
+```
+
+## Attributes
+
+| Attribute         | Description                                            | Required | Default |
+| ----------------- | ------------------------------------------------------ | -------- | ------- |
+| score-threshold	| Similarity score threshold used to determine whether to return a cached response to a prompt. Value is a decimal between 0.0 and 1.0. [Learn more](../azure-cache-for-redis/cache-tutorial-semantic-cache.md#change-the-similarity-threshold). | Yes |	N/A |
+| embeddings-backend-id | [Backend](backends.md) ID for OpenAI embeddings API call. |	Yes |	N/A |
+| embeddings-backend-auth | Authentication used for Azure OpenAI embeddings API backend. | Yes. Must be set to `system-assigned`. | N/A |
+| ignore-system-messages | Boolean. If set to `true`, removes system messages from a GPT chat completion prompt before assessing cache similarity. | No | false |
+| max-message-count | If specified, number of remaining dialog messages after which caching is skipped. | No | N/A |
+                                             
+## Elements
+
+|Name|Description|Required|
+|----------|-----------------|--------------|
+|vary-by| A custom expression determined at runtime whose value partitions caching. If multiple `vary-by` elements are added, values are concatenated to create a unique combination. | No |
+
+## Usage
+
+
+- [**Policy sections:**](./api-management-howto-policies.md#sections) inbound
+- [**Policy scopes:**](./api-management-howto-policies.md#scopes) global, product, API, operation
+-  [**Gateways:**](api-management-gateways-overview.md) v2
+
+### Usage notes
+
+- This policy can only be used once in a policy section.
+
+
+## Examples
+
+### Example with corresponding llm-semantic-cache-store policy
+
+[!INCLUDE [api-management-semantic-cache-example](../../includes/api-management-semantic-cache-example.md)]
+
+## Related policies
+
+* [Caching](api-management-policies.md#caching)
+* [llm-semantic-cache-store](llm-semantic-cache-store-policy.md)
+
+[!INCLUDE [api-management-policy-ref-next-steps](../../includes/api-management-policy-ref-next-steps.md)]
diff --git a/articles/api-management/llm-semantic-cache-store-policy.md b/articles/api-management/llm-semantic-cache-store-policy.md
@@ -0,0 +1,64 @@
+---
+title: Azure API Management policy reference - llm-semantic-cache-store
+description: Reference for the llm-semantic-cache-store policy available for use in Azure API Management. Provides policy usage, settings, and examples.
+services: api-management
+author: dlepow
+
+ms.service: azure-api-management
+ms.collection: ce-skilling-ai-copilot
+ms.custom:
+ms.topic: article
+ms.date: 08/08/2024
+ms.author: danlep
+---
+
+# Cache responses to large language model API requests
+
+[!INCLUDE [api-management-availability-all-tiers](../../includes/api-management-availability-all-tiers.md)]
+
+The `llm-semantic-cache-store` policy caches responses to chat completion API and completion API requests to a configured external cache. Response caching reduces bandwidth and processing requirements imposed on the backend Azure OpenAI API and lowers latency perceived by API consumers.
+
+> [!NOTE]
+> * This policy must have a corresponding [Get cached responses to large language model API requests](llm-semantic-cache-lookup-policy.md) policy. 
+> * For prerequisites and steps to enable semantic caching, see [Enable semantic caching for Azure OpenAI APIs in Azure API Management](azure-openai-enable-semantic-caching.md). 
+> * Currently, this policy is in preview.
+
+[!INCLUDE [api-management-policy-generic-alert](../../includes/api-management-policy-generic-alert.md)]
+
+## Policy statement
+
+```xml
+<llm-semantic-cache-store duration="seconds"/>
+```
+
+
+## Attributes
+
+| Attribute         | Description                                            | Required | Default |
+| ----------------- | ------------------------------------------------------ | -------- | ------- |
+| duration         | Time-to-live of the cached entries, specified in seconds. Policy expressions are allowed.    | Yes      | N/A               |
+
+
+## Usage
+
+- [**Policy sections:**](./api-management-howto-policies.md#sections) outbound
+- [**Policy scopes:**](./api-management-howto-policies.md#scopes) global, product, API, operation
+-  [**Gateways:**](api-management-gateways-overview.md) v2
+
+### Usage notes
+
+- This policy can only be used once in a policy section.
+- If the cache lookup fails, the API call that uses the cache-related operation doesn't raise an error, and the cache operation completes successfully. 
+
+## Examples
+
+### Example with corresponding llm-semantic-cache-lookup policy
+
+[!INCLUDE [api-management-semantic-cache-example](../../includes/api-management-semantic-cache-example.md)]
+
+## Related policies
+
+* [Caching](api-management-policies.md#caching)
+* [llm-semantic-cache-lookup](llm-semantic-cache-lookup-policy.md)
+
+[!INCLUDE [api-management-policy-ref-next-steps](../../includes/api-management-policy-ref-next-steps.md)]
diff --git a/articles/api-management/llm-token-limit-policy.md b/articles/api-management/llm-token-limit-policy.md
diff --git a/includes/api-management-llm-models.md b/includes/api-management-llm-models.md