draft complete

dlepow · dlepow · commit 4fe54f4f8633 · 2024-08-08T13:17:59.000-07:00
diff --git a/articles/api-management/api-management-policies.md b/articles/api-management/api-management-policies.md
@@ -37,7 +37,7 @@ More information about policies:
 | [Set usage quota by key](quota-by-key-policy.md) |  Allows you to enforce a renewable or lifetime call volume and/or bandwidth quota, on a per key basis. | Yes | No | No | Yes | 
 | [Limit concurrency](limit-concurrency-policy.md) | Prevents enclosed policies from executing by more than the specified number of requests at a time. | Yes | Yes | Yes | Yes |
 | [Limit Azure OpenAI Service token usage](azure-openai-token-limit-policy.md) | Prevents Azure OpenAI API usage spikes by limiting language model tokens per calculated key. | Yes | Yes | No | No |
-| [Limit large language model API token usage](llm-token-limit-policy.md) | Prevents large language model API usage spikes by limiting language model tokens per calculated key. | Yes | Yes | No | No |
+| [Limit large language model API token usage](llm-token-limit-policy.md) | Prevents large language model (LLM) API usage spikes by limiting LLM tokens per calculated key. | Yes | Yes | No | No |
 
 ## Authentication and authorization
 
@@ -134,7 +134,7 @@ More information about policies:
 |  [Trace](trace-policy.md) | Adds custom traces into the [request tracing](./api-management-howto-api-inspector.md) output in the test console, Application Insights telemetries, and resource logs. | Yes | Yes<sup>1</sup> | Yes | Yes |
 |  [Emit metrics](emit-metric-policy.md) | Sends custom metrics to Application Insights at execution. | Yes | Yes | Yes | Yes |
 |  [Emit Azure OpenAI token metrics](azure-openai-emit-token-metric-policy.md) | Sends metrics to Application Insights for consumption of language model tokens through Azure OpenAI service APIs. | Yes | Yes | No | No |
-|  [Emit large language model API token metrics](llm-emit-token-metric-policy.md) | Sends metrics to Application Insights for consumption of language model tokens through large language model APIs. | Yes | Yes | No | No |
+|  [Emit large language model API token metrics](llm-emit-token-metric-policy.md) | Sends metrics to Application Insights for consumption of large language model (LLM) tokens through LLM APIs. | Yes | Yes | No | No |
 
 
 <sup>1</sup> In the V2 gateway, the `trace` policy currently does not add tracing output in the test console.
diff --git a/articles/api-management/azure-openai-enable-semantic-caching.md b/articles/api-management/azure-openai-enable-semantic-caching.md
@@ -17,6 +17,9 @@ ms.collection: ce-skilling-ai-copilot
 
 Enable semantic caching of responses to Azure OpenAI API requests to reduce bandwidth and processing requirements imposed on the backend APIs and lower latency perceived by API consumers. With semantic caching, you can return cached responses for identical prompts and also for prompts that are similar in meaning, even if the text isn't the same. For background, see [Tutorial: Use Azure Cache for Redis as a semantic cache](../azure-cache-for-redis/cache-tutorial-semantic-cache.md).
 
+> [!NOTE]
+> The configuration steps in this article enable semantic caching for Azure OpenAI APIs. These steps can be generalized to enable semantic caching for corresponding large language model (LLM) APIs available through the [Azure AI Model Inference API](../ai-studio/reference/reference-model-inference-api.md). 
+
 ## Prerequisites
 
 * One or more Azure OpenAI Service APIs must be added to your API Management instance. For more information, see [Add an Azure OpenAI Service API to Azure API Management](azure-openai-api-from-specification.md).
@@ -48,13 +51,13 @@ with request body:
 
 When the request succeeds, the response includes a completion for the chat message.
 
-## Create a backend for Embeddings API
+## Create a backend for embeddings API
 
-Configure a [backend](backends.md) resource for the Embeddings API deployment with the following settings:
+Configure a [backend](backends.md) resource for the embeddings API deployment with the following settings:
 
 * **Name** - A name of your choice, such as `embeddings-backend`. You use this name to reference the backend in policies.
 * **Type** - Select **Custom URL**.
-* **Runtime URL** - The URL of the Embeddings API deployment in the Azure OpenAI Service, similar to:
+* **Runtime URL** - The URL of the embeddings API deployment in the Azure OpenAI Service, similar to:
         ```
         https://my-aoai.openai.azure.com/openai/deployments/embeddings-deployment/embeddings
         ```
@@ -111,6 +114,9 @@ If the request is successful, the response includes a vector representation of t
 Configure the following policies to enable semantic caching for Azure OpenAI APIs in Azure API Management:
 * In the **Inbound processing** section for the API, add the [azure-openai-semantic-cache-lookup](azure-openai-semantic-cache-lookup-policy.md) policy. In the `embeddings-backend-id` attribute, specify the Embeddings API backend you created.
 
+    > [!NOTE]
+    > When enabling semantic caching for other large language model APIs, use the [llm-semantic-cache-lookup](llm-semantic-cache-lookup-policy.md) policy instead.
+
     Example:
 
     ```xml
@@ -125,6 +131,9 @@ Configure the following policies to enable semantic caching for Azure OpenAI API
     
 * In the **Outbound processing** section for the API, add the [azure-openai-semantic-cache-store](azure-openai-semantic-cache-store-policy.md) policy.
 
+    > [!NOTE]
+    > When enabling semantic caching for other large language model APIs, use the [llm-semantic-cache-store](llm-semantic-cache-store-policy.md) policy instead.
+
     Example:
 
     ```xml
diff --git a/articles/api-management/llm-emit-token-metric-policy.md b/articles/api-management/llm-emit-token-metric-policy.md
@@ -6,7 +6,7 @@ author: dlepow
 
 ms.service: azure-api-management
 ms.topic: article
-ms.date: 08/07/2024
+ms.date: 08/08/2024
 ms.author: danlep
 ms.collection: ce-skilling-ai-copilot
 ms.custom:
@@ -16,7 +16,7 @@ ms.custom:
 
 [!INCLUDE [api-management-availability-all-tiers](../../includes/api-management-availability-all-tiers.md)]
 
-The `llm-emit-token-metric` policy sends metrics to Application Insights about consumption of large language model tokens through LLM APIs. Token count metrics include: Total Tokens, Prompt Tokens, and Completion Tokens. 
+The `llm-emit-token-metric` policy sends metrics to Application Insights about consumption of large language model (LLM) tokens through LLM APIs. Token count metrics include: Total Tokens, Prompt Tokens, and Completion Tokens. 
 
 [!INCLUDE [api-management-policy-generic-alert](../../includes/api-management-policy-generic-alert.md)]
 
diff --git a/articles/api-management/llm-enable-semantic-caching.md b/articles/api-management/llm-enable-semantic-caching.md
diff --git a/articles/api-management/llm-semantic-cache-lookup-policy.md b/articles/api-management/llm-semantic-cache-lookup-policy.md
@@ -13,15 +13,15 @@ ms.date: 08/07/2024
 ms.author: danlep
 ---
 
-# Get cached responses of Azure OpenAI API requests
+# Get cached responses of large language model API requests
 
 [!INCLUDE [api-management-availability-all-tiers](../../includes/api-management-availability-all-tiers.md)]
 
-Use the `llm-semantic-cache-lookup` policy to perform cache lookup of responses to Azure OpenAI Chat Completion API and Completion API requests from a configured external cache, based on vector proximity of the prompt to previous requests and a specified similarity score threshold. Response caching reduces bandwidth and processing requirements imposed on the backend Azure OpenAI API and lowers latency perceived by API consumers.
+Use the `llm-semantic-cache-lookup` policy to perform cache lookup of responses to large language model (LLM) API requests from a configured external cache, based on vector proximity of the prompt to previous requests and a specified similarity score threshold. Response caching reduces bandwidth and processing requirements imposed on the backend LLM API and lowers latency perceived by API consumers.
 
 > [!NOTE]
-> * This policy must have a corresponding [Cache responses to Azure OpenAI API requests](llm-semantic-cache-store-policy.md) policy. 
-> * For prerequisites and steps to enable semantic caching, see [Enable semantic caching for Azure OpenAI APIs in Azure API Management](llm-enable-semantic-caching.md).
+> * This policy must have a corresponding [Cache responses to large language model API requests](llm-semantic-cache-store-policy.md) policy. 
+> * For prerequisites and steps to enable semantic caching, see [Enable semantic caching for Azure OpenAI APIs in Azure API Management](azure-openai-enable-semantic-caching.md).
 > * Currently, this policy is in preview.
 
 [!INCLUDE [api-management-policy-generic-alert](../../includes/api-management-policy-generic-alert.md)]
diff --git a/articles/api-management/llm-semantic-cache-store-policy.md b/articles/api-management/llm-semantic-cache-store-policy.md
@@ -8,19 +8,19 @@ ms.service: azure-api-management
 ms.collection: ce-skilling-ai-copilot
 ms.custom:
 ms.topic: article
-ms.date: 08/07/2024
+ms.date: 08/08/2024
 ms.author: danlep
 ---
 
 # Cache responses to large language model API requests
 
 [!INCLUDE [api-management-availability-all-tiers](../../includes/api-management-availability-all-tiers.md)]
 
-The `llm-semantic-cache-store` policy caches responses to [TBD - add here] requests to a configured external cache. Response caching reduces bandwidth and processing requirements imposed on the backend Azure OpenAI API and lowers latency perceived by API consumers.
+The `llm-semantic-cache-store` policy caches responses to chat completion API and completion API requests to a configured external cache. Response caching reduces bandwidth and processing requirements imposed on the backend Azure OpenAI API and lowers latency perceived by API consumers.
 
 > [!NOTE]
 > * This policy must have a corresponding [Get cached responses to large language model API requests](llm-semantic-cache-lookup-policy.md) policy. 
-> * For prerequisites and steps to enable semantic caching, see [Enable semantic caching for large language model APIs in Azure API Management](llm-enable-semantic-caching.md).
+> * For prerequisites and steps to enable semantic caching, see [Enable semantic caching for Azure OpenAI APIs in Azure API Management](azure-openai-enable-semantic-caching.md). 
 > * Currently, this policy is in preview.
 
 [!INCLUDE [api-management-policy-generic-alert](../../includes/api-management-policy-generic-alert.md)]
diff --git a/articles/api-management/llm-token-limit-policy.md b/articles/api-management/llm-token-limit-policy.md
@@ -8,7 +8,7 @@ ms.service: azure-api-management
 ms.collection: ce-skilling-ai-copilot
 ms.custom:
 ms.topic: article
-ms.date: 08/07/2024
+ms.date: 08/08/2024
 ms.author: danlep
 ---
 
@@ -43,7 +43,7 @@ By relying on token usage metrics returned from the LLM endpoint, the policy can
 | -------------- | ----------------------------------------------------------------------------------------------------- | -------- | ------- |
 | counter-key          | The key to use for the token limit policy. For each key value, a single counter is used for all scopes at which the policy is configured. Policy expressions are allowed.| Yes      | N/A     |
 | tokens-per-minute | The maximum number of tokens consumed by prompt and completion per minute.         | Yes      | N/A     |
-| estimate-prompt-tokens | Boolean value that determines whether to estimate the number of tokens required for a prompt: <br> - `true`: estimate the number of tokens based on prompt schema in API; may reduce performance. <br> - `false`: don't estimate prompt tokens.  | Yes       | N/A     |
+| estimate-prompt-tokens | Boolean value that determines whether to estimate the number of tokens required for a prompt: <br> - `true`: estimate the number of tokens based on prompt schema in API; may reduce performance. <br> - `false`: don't estimate prompt tokens. <br><br>When set to `false`, the remaining tokens per `counter-key` are calculated using the actual token usage from the response of the model. This could result in prompts being sent to the model that exceed the token limit. In such case, this will be detected in the response, and all succeeding requests will be blocked by the policy until the token limit frees up again.  | Yes       | N/A     |
 | retry-after-header-name    | The name of a custom response header whose value is the recommended retry interval in seconds after the specified `tokens-per-minute` is exceeded. Policy expressions aren't allowed. |  No | `Retry-After`  |
 | retry-after-variable-name    | The name of a variable that stores the recommended retry interval in seconds after the specified `tokens-per-minute` is exceeded. Policy expressions aren't allowed. |  No | N/A  |
 | remaining-tokens-header-name    | The name of a response header whose value after each policy execution is the number of remaining tokens allowed for the time interval. Policy expressions aren't allowed.|  No | N/A  |
@@ -59,6 +59,7 @@ By relying on token usage metrics returned from the LLM endpoint, the policy can
 
 ### Usage notes
 
+* 
 * This policy can be used multiple times per policy definition.
 * Where available when `estimate-prompt-tokens` is set to `false`, values in the usage section of the response from the LLM API are used to determine token usage.
 * Certain LLM endpoints support streaming of responses. When `stream` is set to `true` in the API request to enable streaming, prompt tokens are always estimated, regardless of the value of the `estimate-prompt-tokens` attribute.
diff --git a/includes/api-management-llm-models.md b/includes/api-management-llm-models.md
@@ -10,4 +10,4 @@ ms.author: danlep
 
 ## Supported models
 
-The policy is used with LLMs added to Azure API Management that are exposed through the [Azure AI Model Inference API](../articles/ai-studio/reference/reference-model-inference-api.md).
+Use the policy with LLM APIs added to Azure API Management that are available through the [Azure AI Model Inference API](../articles/ai-studio/reference/reference-model-inference-api.md).

Original file line number	Diff line number	Diff line change
`@@ -10,4 +10,4 @@ ms.author: danlep`
`10`	`10`
`11`	`11`	`## Supported models`
`12`	`12`
`13`		`-The policy is used with LLMs added to Azure API Management that are exposed through the [Azure AI Model Inference API](../articles/ai-studio/reference/reference-model-inference-api.md).`
	`13`	`+Use the policy with LLM APIs added to Azure API Management that are available through the [Azure AI Model Inference API](../articles/ai-studio/reference/reference-model-inference-api.md).`