MicrosoftDocs
diff --git a/‎articles/api-management/api-management-policies.md
Lines changed: 4 additions & 0 deletions b/‎articles/api-management/api-management-policies.md
Lines changed: 4 additions & 0 deletions
diff --git a/‎articles/api-management/azure-openai-token-limit-policy.md
Lines changed: 1 addition & 0 deletions b/‎articles/api-management/azure-openai-token-limit-policy.md
Lines changed: 1 addition & 0 deletions
diff --git a/‎articles/api-management/llm-emit-token-metric-policy.md
Lines changed: 113 additions & 0 deletions b/‎articles/api-management/llm-emit-token-metric-policy.md
Lines changed: 113 additions & 0 deletions
diff --git a/‎articles/api-management/llm-enable-semantic-caching.md
Lines changed: 142 additions & 0 deletions b/‎articles/api-management/llm-enable-semantic-caching.md
Lines changed: 142 additions & 0 deletions
diff --git a/‎articles/api-management/llm-semantic-cache-lookup-policy.md
Lines changed: 81 additions & 0 deletions b/‎articles/api-management/llm-semantic-cache-lookup-policy.md
Lines changed: 81 additions & 0 deletions
@@ -37,6 +37,7 @@ More information about policies:
 | [Set usage quota by key](quota-by-key-policy.md) |  Allows you to enforce a renewable or lifetime call volume and/or bandwidth quota, on a per key basis. | Yes | No | No | Yes | 
 | [Limit concurrency](limit-concurrency-policy.md) | Prevents enclosed policies from executing by more than the specified number of requests at a time. | Yes | Yes | Yes | Yes |
 | [Limit Azure OpenAI Service token usage](azure-openai-token-limit-policy.md) | Prevents Azure OpenAI API usage spikes by limiting language model tokens per calculated key. | Yes | Yes | No | No |
+| [Limit large language model API token usage](llm-token-limit-policy.md) | Prevents large language model API usage spikes by limiting language model tokens per calculated key. | Yes | Yes | No | No |
 
 ## Authentication and authorization
 
@@ -82,6 +83,8 @@ More information about policies:
 |  [Remove value from cache](cache-remove-value-policy.md) | Removes an item in the cache by key. | Yes | Yes | Yes | Yes |
 |  [Get cached responses of Azure OpenAI API requests](azure-openai-semantic-cache-lookup-policy.md) | Performs cache lookup using semantic search and returns a valid cached response when available. | Yes | Yes | Yes | Yes |
 |  [Store responses of Azure OpenAI API requests to cache](azure-openai-semantic-cache-store-policy.md) | Caches response according to the Azure OpenAI API cache configuration. | Yes | Yes | Yes | Yes |
+|  [Get cached responses of large language model API requests](llm-semantic-cache-lookup-policy.md) | Performs cache lookup using semantic search and returns a valid cached response when available. | Yes | Yes | Yes | Yes |
+|  [Store responses of large language model API requests to cache](llm-semantic-cache-store-policy.md) | Caches response according to the large language model API cache configuration. | Yes | Yes | Yes | Yes |
 
 
 
@@ -131,6 +134,7 @@ More information about policies:
 |  [Trace](trace-policy.md) | Adds custom traces into the [request tracing](./api-management-howto-api-inspector.md) output in the test console, Application Insights telemetries, and resource logs. | Yes | Yes<sup>1</sup> | Yes | Yes |
 |  [Emit metrics](emit-metric-policy.md) | Sends custom metrics to Application Insights at execution. | Yes | Yes | Yes | Yes |
 |  [Emit Azure OpenAI token metrics](azure-openai-emit-token-metric-policy.md) | Sends metrics to Application Insights for consumption of language model tokens through Azure OpenAI service APIs. | Yes | Yes | No | No |
+|  [Emit large language model API token metrics](llm-emit-token-metric-policy.md) | Sends metrics to Application Insights for consumption of language model tokens through large language model APIs. | Yes | Yes | No | No |
 
 
 <sup>1</sup> In the V2 gateway, the `trace` policy currently does not add tracing output in the test console.
 
@@ -87,6 +87,7 @@ In the following example, the token limit of 5000 per minute is keyed by the cal
 ## Related policies
 
 * [Rate limiting and quotas](api-management-policies.md#rate-limiting-and-quotas)
+* [llm-token-limit](llm-token-limit-policy.md) policy
 * [azure-openai-emit-token-metric](azure-openai-emit-token-metric-policy.md) policy
 
 [!INCLUDE [api-management-policy-ref-next-steps](../../includes/api-management-policy-ref-next-steps.md)]
@@ -0,0 +1,113 @@
+---
+title: Azure API Management policy reference - llm-emit-token-metric
+description: Reference for the llm-emit-token-metric policy available for use in Azure API Management. Provides policy usage, settings, and examples.
+services: api-management
+author: dlepow
+
+ms.service: azure-api-management
+ms.topic: article
+ms.date: 08/07/2024
+ms.author: danlep
+ms.collection: ce-skilling-ai-copilot
+ms.custom:
+---
+
+# Emit metrics for consumption of large language model tokens
+
+[!INCLUDE [api-management-availability-all-tiers](../../includes/api-management-availability-all-tiers.md)]
+
+The `llm-emit-token-metric` policy sends metrics to Application Insights about consumption of large language model tokens through LLM APIs. Token count metrics include: Total Tokens, Prompt Tokens, and Completion Tokens. 
+
+[!INCLUDE [api-management-policy-generic-alert](../../includes/api-management-policy-generic-alert.md)]
+
+[!INCLUDE [api-management-llm-models](../../includes/api-management-llm-models.md)]
+
+
+## Prerequisites
+
+* One or more LLM APIs must be added to your API Management instance. 
+* Your API Management instance must be integrated with Application insights. For more information, see [How to integrate Azure API Management with Azure Application Insights](./api-management-howto-app-insights.md#create-a-connection-using-the-azure-portal).
+* Enable Application Insights logging for your LLM APIs. 
+* Enable custom metrics with dimensions in Application Insights. For more information, see [Emit custom metrics](api-management-howto-app-insights.md#emit-custom-metrics).
+
+## Policy statement
+
+```xml
+<llm-emit-token-metric
+        namespace="metric namespace" >      
+        <dimension name="dimension name" value="dimension value" />
+        ...additional dimensions...
+</llm-emit-token-metric>
+```
+
+## Attributes
+
+| Attribute | Description                | Required                | Default value  |
+| --------- | -------------------------- |  ------------------ | -------------- |
+| namespace | A string. Namespace of metric. Policy expressions aren't allowed. | No        | API Management |
+| value     |  Value of metric expressed as a double. Policy expressions are allowed.   | No           | 1              |
+
+
+## Elements
+
+| Element     | Description                                                                       | Required |
+| ----------- | --------------------------------------------------------------------------------- | -------- |
+| dimension   | Add one or more of these elements for each dimension included in the metric.  | Yes      |
+
+### dimension attributes
+
+| Attribute | Description                | Required |  Default value  |
+| --------- | -------------------------- |  ------------------ | -------------- |
+| name      | A string or policy expression. Name of dimension.      | Yes      |  N/A            |
+| value     | A string or policy expression. Value of dimension. Can only be omitted if `name` matches one of the default dimensions. If so, value is provided as per dimension name. | No        | N/A |
+
+ ### Default dimension names that may be used without value
+
+* API ID
+* Operation ID
+* Product ID
+* User ID
+* Subscription ID
+* Location
+* Gateway ID
+
+## Usage
+
+- [**Policy sections:**](./api-management-howto-policies.md#sections) inbound
+- [**Policy scopes:**](./api-management-howto-policies.md#scopes) global, workspace, product, API, operation
+-  [**Gateways:**](api-management-gateways-overview.md) classic, v2, consumption, self-hosted, workspace
+
+### Usage notes
+
+* This policy can be used multiple times per policy definition.
+* You can configure at most 10 custom dimensions for this policy.
+* Where available, values in the usage section of the response from the LLM API are used to determine token metrics.
+* Certain LLM endpoints support streaming of responses. When `stream` is set to `true` in the API request to enable streaming, token metrics are estimated.
+
+## Example
+
+The following example sends LLM token count metrics to Application Insights along with User ID, Client IP, and API ID as dimensions.
+
+```xml
+<policies>
+  <inbound>
+      <llm-emit-token-metric
+            namespace="MyLLM">   
+            <dimension name="User ID" />
+            <dimension name="Client IP" value="@(context.Request.IpAddress)" />
+            <dimension name="API ID" />
+        </llm-emit-token-metric> 
+  </inbound>
+  <outbound>
+  </outbound>
+</policies>
+```
+
+## Related policies
+
+* [Logging](api-management-policies.md#logging)
+* [emit-metric](emit-metric-policy.md) policy
+* [azure-openai-emit-token-metric](azure-openai-emit-token-metric-policy.md) policy
+* [llm-token-limit](llm-token-limit-policy.md) policy 
+
+[!INCLUDE [api-management-policy-ref-next-steps](../../includes/api-management-policy-ref-next-steps.md)]
@@ -0,0 +1,142 @@
+---
+title: Enable semantic caching for LLM APIs in Azure API Management
+description: Prerequisites and configuration steps to enable semantic caching for large language model APIs in Azure API Management.
+author: dlepow
+ms.service: azure-api-management
+ms.custom:
+ms.topic: how-to
+ms.date: 08/07/2024
+ms.author: danlep
+ms.collection: ce-skilling-ai-copilot
+---
+
+# Enable semantic caching for large language APIs in Azure API Management
+
+[!INCLUDE [api-management-availability-basicv2-standardv2](../../includes/api-management-availability-basicv2-standardv2.md)]
+
+Enable semantic caching of responses to large language model (LLM) API requests to reduce bandwidth and processing requirements imposed on the backend APIs and lower latency perceived by API consumers. With semantic caching, you can return cached responses for identical prompts and also for prompts that are similar in meaning, even if the text isn't the same. For background, see [Tutorial: Use Azure Cache for Redis as a semantic cache](../azure-cache-for-redis/cache-tutorial-semantic-cache.md).
+
+## Prerequisites
+
+* One or more LLM APIs must be added to your API Management instance. For more information, see [TBD...].
+* Prerequisites for Azure AI Model Inference AI [TBD...].
+* The API Management instance must be configured to use managed identity authentication to the LLM APIs. For more information, see [Authenticate and authorize access to Azure OpenAI APIs using Azure API Management ](api-management-authenticate-authorize-azure-openai.md#authenticate-with-managed-identity).
+* [Azure Cache for Redis Enterprise](../azure-cache-for-redis/quickstart-create-redis-enterprise.md). The **RediSearch** module must be enabled on the Redis Enterprise cache.
+    > [!NOTE]
+    > You can only enable the **RediSearch** module when creating a new Redis Enterprise cache. You can't add a module to an existing cache. [Learn more](../azure-cache-for-redis/cache-redis-modules.md)
+* External cache configured in the Azure API Management instance. For steps, see [Use an external Azure Cache for Redis in Azure API Management](api-management-howto-cache-external.md).
+
+<!-- The following steps are for AOAI. Revise for Azure AI Model Inference API -->
+## Test Chat API deployment
+
+First, test the Azure OpenAI deployment to ensure that the Chat Completion API or Chat API is working as expected. For steps, see [Import an Azure OpenAI API to Azure API Management](llm-api-from-specification.md#test-the-llm-api).
+
+For example, test the Azure OpenAI Chat API by sending a POST request to the API endpoint with a prompt in the request body. The response should include the completion of the prompt. Example request:
+
+```rest
+POST https://my-api-management.azure-api.net/my-api/openai/deployments/chat-deployment/chat/completions?api-version=2024-02-01
+```
+
+with request body:
+
+```json
+{"messages":[{"role":"user","content":"Hello"}]}
+```
+
+When the request succeeds, the response includes a completion for the chat message.
+
+## Create a backend for Embeddings API
+
+Configure a [backend](backends.md) resource for the Embeddings API deployment with the following settings:
+
+* **Name** - A name of your choice, such as `embeddings-backend`. You use this name to reference the backend in policies.
+* **Type** - Select **Custom URL**.
+* **Runtime URL** - The URL of the Embeddings API deployment in the Azure OpenAI Service, similar to:
+        ```
+        https://my-aoai.openai.azure.com/openai/deployments/embeddings-deployment/embeddings
+        ```
+### Test backend 
+
+To test the backend, create an API operation for your Azure OpenAI Service API:
+
+1. On the **Design** tab of your API, select **+ Add operation**.
+1. Enter a **Display name** and optionally a **Name** for the operation.
+1. In the **Frontend** section, in **URL**, select **POST** and enter the path `/`.
+1. On the **Headers** tab, add a required header with the name `Content-Type` and value `application/json`.
+1. Select **Save**
+
+Configure the following policies in the **Inbound processing** section of the API operation. In the [set-backend-service](set-backend-service-policy.md) policy, substitute the name of the backend you created.
+
+```xml
+<policies>
+    <inbound>
+        <set-backend-service backend-id="embeddings-backend" />
+        <authentication-managed-identity resource="https://cognitiveservices.azure.com/" />
+        [...]
+    </inbound>
+    [...]
+</policies>
+```
+ 
+On the **Test** tab, test the operation by adding an `api-version` query parameter with value such as `2024-02-01`. Provide a valid request body. For example:
+
+```json
+{"input":"Hello"}
+```        
+
+If the request is successful, the response includes a vector representation of the input text:
+
+```json
+{
+    "object": "list",
+    "data": [{
+        "object": "embedding",
+        "index": 0,
+        "embedding": [
+            -0.021829502,
+            -0.007157768,
+            -0.028619017,
+            [...]
+        ]
+    }]
+}
+
+```
+
+## Configure semantic caching policies
+
+Configure the following policies to enable semantic caching for Azure OpenAI APIs in Azure API Management:
+* In the **Inbound processing** section for the API, add the [llm-semantic-cache-lookup](llm-semantic-cache-lookup-policy.md) policy. In the `embeddings-backend-id` attribute, specify the Embeddings API backend you created.
+
+    Example:
+
+    ```xml
+    <llm-semantic-cache-lookup
+        score-threshold="0.8"
+        embeddings-backend-id="embeddings-deployment"
+        embeddings-backend-auth="system-assigned"
+        ignore-system-messages="true"
+        max-message-count="10">
+        <vary-by>@(context.Subscription.Id)</vary-by>
+    </llm-semantic-cache-lookup>
+    
+* In the **Outbound processing** section for the API, add the [llm-semantic-cache-store](llm-semantic-cache-store-policy.md) policy.
+
+    Example:
+
+    ```xml
+    <llm-semantic-cache-store duration="60" />
+    ```
+
+## Confirm caching
+
+To confirm that semantic caching is working as expected, trace a test Completion or Chat Completion operation using the test console in the portal. Confirm that the cache was used on subsequent tries by inspecting the trace. [Learn more about tracing API calls in Azure API Management](api-management-howto-api-inspector.md).
+
+For example, if the cache was used, the **Output** section includes entries similar to ones in the following screenshot:
+
+:::image type="content" source="media/llm-enable-semantic-caching/cache-lookup.png" alt-text="Screenshot of request trace in the Azure portal.":::
+
+## Related content
+
+* [Caching policies](api-management-policies.md#caching)
+* [Azure Cache for Redis](../azure-cache-for-redis/cache-overview.md)
@@ -0,0 +1,81 @@
+---
+title: Azure API Management policy reference - llm-semantic-cache-lookup | Microsoft Docs
+description: Reference for the llm-semantic-cache-lookup policy available for use in Azure API Management. Provides policy usage, settings, and examples.
+services: api-management
+author: dlepow
+
+ms.service: azure-api-management
+ms.collection: ce-skilling-ai-copilot
+ms.custom:
+  - build-2024
+ms.topic: article
+ms.date: 08/07/2024
+ms.author: danlep
+---
+
+# Get cached responses of Azure OpenAI API requests
+
+[!INCLUDE [api-management-availability-all-tiers](../../includes/api-management-availability-all-tiers.md)]
+
+Use the `llm-semantic-cache-lookup` policy to perform cache lookup of responses to Azure OpenAI Chat Completion API and Completion API requests from a configured external cache, based on vector proximity of the prompt to previous requests and a specified similarity score threshold. Response caching reduces bandwidth and processing requirements imposed on the backend Azure OpenAI API and lowers latency perceived by API consumers.
+
+> [!NOTE]
+> * This policy must have a corresponding [Cache responses to Azure OpenAI API requests](llm-semantic-cache-store-policy.md) policy. 
+> * For prerequisites and steps to enable semantic caching, see [Enable semantic caching for Azure OpenAI APIs in Azure API Management](llm-enable-semantic-caching.md).
+> * Currently, this policy is in preview.
+
+[!INCLUDE [api-management-policy-generic-alert](../../includes/api-management-policy-generic-alert.md)]
+
+## Policy statement
+
+```xml
+<llm-semantic-cache-lookup
+    score-threshold="similarity score threshold"
+    embeddings-backend-id ="backend entity ID for embeddings API"
+    embeddings-backend-auth ="system-assigned"             
+    ignore-system-messages="true | false"      
+    max-message-count="count" >
+    <vary-by>"expression to partition caching"</vary-by>
+</llm-semantic-cache-lookup>
+```
+
+## Attributes
+
+| Attribute         | Description                                            | Required | Default |
+| ----------------- | ------------------------------------------------------ | -------- | ------- |
+| score-threshold	| Similarity score threshold used to determine whether to return a cached response to a prompt. Value is a decimal between 0.0 and 1.0. [Learn more](../azure-cache-for-redis/cache-tutorial-semantic-cache.md#change-the-similarity-threshold). | Yes |	N/A |
+| embeddings-backend-id | [Backend](backends.md) ID for OpenAI embeddings API call. |	Yes |	N/A |
+| embeddings-backend-auth | Authentication used for Azure OpenAI embeddings API backend. | Yes. Must be set to `system-assigned`. | N/A |
+| ignore-system-messages | Boolean. If set to `true`, removes system messages from a GPT chat completion prompt before assessing cache similarity. | No | false |
+| max-message-count | If specified, number of remaining dialog messages after which caching is skipped. | No | N/A |
+                                             
+## Elements
+
+|Name|Description|Required|
+|----------|-----------------|--------------|
+|vary-by| A custom expression determined at runtime whose value partitions caching. If multiple `vary-by` elements are added, values are concatenated to create a unique combination. | No |
+
+## Usage
+
+
+- [**Policy sections:**](./api-management-howto-policies.md#sections) inbound
+- [**Policy scopes:**](./api-management-howto-policies.md#scopes) global, product, API, operation
+-  [**Gateways:**](api-management-gateways-overview.md) v2
+
+### Usage notes
+
+- This policy can only be used once in a policy section.
+
+
+## Examples
+
+### Example with corresponding llm-semantic-cache-store policy
+
+[!INCLUDE [api-management-semantic-cache-example](../../includes/api-management-semantic-cache-example.md)]
+
+## Related policies
+
+* [Caching](api-management-policies.md#caching)
+* [llm-semantic-cache-store](llm-semantic-cache-store-policy.md)
+
+[!INCLUDE [api-management-policy-ref-next-steps](../../includes/api-management-policy-ref-next-steps.md)]