|
| 1 | +--- |
| 2 | +title: Enable semantic caching for LLM APIs in Azure API Management |
| 3 | +description: Prerequisites and configuration steps to enable semantic caching for large language model APIs in Azure API Management. |
| 4 | +author: dlepow |
| 5 | +ms.service: azure-api-management |
| 6 | +ms.custom: |
| 7 | +ms.topic: how-to |
| 8 | +ms.date: 08/07/2024 |
| 9 | +ms.author: danlep |
| 10 | +ms.collection: ce-skilling-ai-copilot |
| 11 | +--- |
| 12 | + |
| 13 | +# Enable semantic caching for large language APIs in Azure API Management |
| 14 | + |
| 15 | +[!INCLUDE [api-management-availability-basicv2-standardv2](../../includes/api-management-availability-basicv2-standardv2.md)] |
| 16 | + |
| 17 | +Enable semantic caching of responses to large language model (LLM) API requests to reduce bandwidth and processing requirements imposed on the backend APIs and lower latency perceived by API consumers. With semantic caching, you can return cached responses for identical prompts and also for prompts that are similar in meaning, even if the text isn't the same. For background, see [Tutorial: Use Azure Cache for Redis as a semantic cache](../azure-cache-for-redis/cache-tutorial-semantic-cache.md). |
| 18 | + |
| 19 | +## Prerequisites |
| 20 | + |
| 21 | +* One or more LLM APIs must be added to your API Management instance. For more information, see [TBD...]. |
| 22 | +* Prerequisites for Azure AI Model Inference AI [TBD...]. |
| 23 | +* The API Management instance must be configured to use managed identity authentication to the LLM APIs. For more information, see [Authenticate and authorize access to Azure OpenAI APIs using Azure API Management ](api-management-authenticate-authorize-azure-openai.md#authenticate-with-managed-identity). |
| 24 | +* [Azure Cache for Redis Enterprise](../azure-cache-for-redis/quickstart-create-redis-enterprise.md). The **RediSearch** module must be enabled on the Redis Enterprise cache. |
| 25 | + > [!NOTE] |
| 26 | + > You can only enable the **RediSearch** module when creating a new Redis Enterprise cache. You can't add a module to an existing cache. [Learn more](../azure-cache-for-redis/cache-redis-modules.md) |
| 27 | +* External cache configured in the Azure API Management instance. For steps, see [Use an external Azure Cache for Redis in Azure API Management](api-management-howto-cache-external.md). |
| 28 | + |
| 29 | +<!-- The following steps are for AOAI. Revise for Azure AI Model Inference API --> |
| 30 | +## Test Chat API deployment |
| 31 | + |
| 32 | +First, test the Azure OpenAI deployment to ensure that the Chat Completion API or Chat API is working as expected. For steps, see [Import an Azure OpenAI API to Azure API Management](llm-api-from-specification.md#test-the-llm-api). |
| 33 | + |
| 34 | +For example, test the Azure OpenAI Chat API by sending a POST request to the API endpoint with a prompt in the request body. The response should include the completion of the prompt. Example request: |
| 35 | + |
| 36 | +```rest |
| 37 | +POST https://my-api-management.azure-api.net/my-api/openai/deployments/chat-deployment/chat/completions?api-version=2024-02-01 |
| 38 | +``` |
| 39 | + |
| 40 | +with request body: |
| 41 | + |
| 42 | +```json |
| 43 | +{"messages":[{"role":"user","content":"Hello"}]} |
| 44 | +``` |
| 45 | + |
| 46 | +When the request succeeds, the response includes a completion for the chat message. |
| 47 | + |
| 48 | +## Create a backend for Embeddings API |
| 49 | + |
| 50 | +Configure a [backend](backends.md) resource for the Embeddings API deployment with the following settings: |
| 51 | + |
| 52 | +* **Name** - A name of your choice, such as `embeddings-backend`. You use this name to reference the backend in policies. |
| 53 | +* **Type** - Select **Custom URL**. |
| 54 | +* **Runtime URL** - The URL of the Embeddings API deployment in the Azure OpenAI Service, similar to: |
| 55 | + ``` |
| 56 | + https://my-aoai.openai.azure.com/openai/deployments/embeddings-deployment/embeddings |
| 57 | + ``` |
| 58 | +### Test backend |
| 59 | + |
| 60 | +To test the backend, create an API operation for your Azure OpenAI Service API: |
| 61 | + |
| 62 | +1. On the **Design** tab of your API, select **+ Add operation**. |
| 63 | +1. Enter a **Display name** and optionally a **Name** for the operation. |
| 64 | +1. In the **Frontend** section, in **URL**, select **POST** and enter the path `/`. |
| 65 | +1. On the **Headers** tab, add a required header with the name `Content-Type` and value `application/json`. |
| 66 | +1. Select **Save** |
| 67 | + |
| 68 | +Configure the following policies in the **Inbound processing** section of the API operation. In the [set-backend-service](set-backend-service-policy.md) policy, substitute the name of the backend you created. |
| 69 | + |
| 70 | +```xml |
| 71 | +<policies> |
| 72 | + <inbound> |
| 73 | + <set-backend-service backend-id="embeddings-backend" /> |
| 74 | + <authentication-managed-identity resource="https://cognitiveservices.azure.com/" /> |
| 75 | + [...] |
| 76 | + </inbound> |
| 77 | + [...] |
| 78 | +</policies> |
| 79 | +``` |
| 80 | + |
| 81 | +On the **Test** tab, test the operation by adding an `api-version` query parameter with value such as `2024-02-01`. Provide a valid request body. For example: |
| 82 | + |
| 83 | +```json |
| 84 | +{"input":"Hello"} |
| 85 | +``` |
| 86 | + |
| 87 | +If the request is successful, the response includes a vector representation of the input text: |
| 88 | + |
| 89 | +```json |
| 90 | +{ |
| 91 | + "object": "list", |
| 92 | + "data": [{ |
| 93 | + "object": "embedding", |
| 94 | + "index": 0, |
| 95 | + "embedding": [ |
| 96 | + -0.021829502, |
| 97 | + -0.007157768, |
| 98 | + -0.028619017, |
| 99 | + [...] |
| 100 | + ] |
| 101 | + }] |
| 102 | +} |
| 103 | + |
| 104 | +``` |
| 105 | + |
| 106 | +## Configure semantic caching policies |
| 107 | + |
| 108 | +Configure the following policies to enable semantic caching for Azure OpenAI APIs in Azure API Management: |
| 109 | +* In the **Inbound processing** section for the API, add the [llm-semantic-cache-lookup](llm-semantic-cache-lookup-policy.md) policy. In the `embeddings-backend-id` attribute, specify the Embeddings API backend you created. |
| 110 | + |
| 111 | + Example: |
| 112 | + |
| 113 | + ```xml |
| 114 | + <llm-semantic-cache-lookup |
| 115 | + score-threshold="0.8" |
| 116 | + embeddings-backend-id="embeddings-deployment" |
| 117 | + embeddings-backend-auth="system-assigned" |
| 118 | + ignore-system-messages="true" |
| 119 | + max-message-count="10"> |
| 120 | + <vary-by>@(context.Subscription.Id)</vary-by> |
| 121 | + </llm-semantic-cache-lookup> |
| 122 | + |
| 123 | +* In the **Outbound processing** section for the API, add the [llm-semantic-cache-store](llm-semantic-cache-store-policy.md) policy. |
| 124 | + |
| 125 | + Example: |
| 126 | + |
| 127 | + ```xml |
| 128 | + <llm-semantic-cache-store duration="60" /> |
| 129 | + ``` |
| 130 | + |
| 131 | +## Confirm caching |
| 132 | + |
| 133 | +To confirm that semantic caching is working as expected, trace a test Completion or Chat Completion operation using the test console in the portal. Confirm that the cache was used on subsequent tries by inspecting the trace. [Learn more about tracing API calls in Azure API Management](api-management-howto-api-inspector.md). |
| 134 | + |
| 135 | +For example, if the cache was used, the **Output** section includes entries similar to ones in the following screenshot: |
| 136 | + |
| 137 | +:::image type="content" source="media/llm-enable-semantic-caching/cache-lookup.png" alt-text="Screenshot of request trace in the Azure portal."::: |
| 138 | + |
| 139 | +## Related content |
| 140 | + |
| 141 | +* [Caching policies](api-management-policies.md#caching) |
| 142 | +* [Azure Cache for Redis](../azure-cache-for-redis/cache-overview.md) |
0 commit comments