You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-foundry/openai/how-to/spillover-traffic-management.md
+25-4Lines changed: 25 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,15 +6,15 @@ ms.author: mopeakande
6
6
ms.service: azure-ai-foundry
7
7
ms.subservice: azure-ai-foundry-openai
8
8
ms.topic: how-to
9
-
ms.date: 09/03/2025
9
+
ms.date: 10/02/2025
10
10
---
11
11
12
12
# Manage traffic with spillover for provisioned deployments
13
13
14
14
Spillover manages traffic fluctuations on provisioned deployments by routing overage traffic to a corresponding standard deployment. Spillover is an optional capability that can be set for all requests on a given deployment or can be managed on a per-request basis. When spillover is enabled, Azure OpenAI in Azure AI Foundry Models sends any overage traffic from your provisioned deployment to a standard deployment for processing.
15
15
16
16
> [!NOTE]
17
-
> Spillover is currently not available for the `/v1`[API endpoint](../reference-preview-latest.md) or [responses API](./responses.md).
17
+
> Spillover is currently not available for the [responses API](./responses.md).
18
18
19
19
## Prerequisites
20
20
- You need to have a provisioned managed deployment and a standard deployment.
@@ -27,7 +27,28 @@ Spillover manages traffic fluctuations on provisioned deployments by routing ove
27
27
To maximize the utilization of your provisioned deployment, you can enable spillover for all global and data zone provisioned deployments. With spillover, bursts or fluctuations in traffic can be automatically managed by the service. This capability reduces the risk of experiencing disruptions when a provisioned deployment is fully utilized. Alternatively, spillover is configurable per-request to provide flexibility across different scenarios and workloads. Spillover can also now be used for the [Azure AI Foundry Agent Service](../../agents/overview.md).
28
28
29
29
## When does spillover come into effect?
30
-
When spillover is enabled for a deployment or configured for a given inference request, spillover is initiated when a non-200 response code is received for a given inference request. When a request results in a non-200 response code, the Azure OpenAI automatically sends the request from your provisioned deployment to your standard deployment to be processed. Even if a subset of requests is routed to the standard deployment, the service prioritizes sending requests to the provisioned deployment before sending any overage requests to the standard deployment, which may incur additional latency.
30
+
When you enable spillover for a deployment or configure it for a given inference request, spillover initiates when a specific non-`200` response code is received for a given inference request as a result of one of these scenarios:
31
+
32
+
- Provisioned throughput units (PTU) are completely used, resulting in a `429` response code.
33
+
34
+
- You send a long context token request, resulting in a `400` error code. For example, when using `gpt 4.1` series models, PTU supports only context lengths less than 128k and returns HTTP 400.
35
+
36
+
- Server errors when processing your request, resulting in error code `500` or `503`.
37
+
38
+
When a request results in one of these non-`200` response codes, Azure OpenAI automatically sends the request from your provisioned deployment to your standard deployment to be processed.
39
+
40
+
> [!NOTE]
41
+
> Even if a subset of requests is routed to the standard deployment, the service prioritizes sending requests to the provisioned deployment before sending any overage requests to the standard deployment, which might incur additional latency.
42
+
43
+
## How to know a request spilled over
44
+
45
+
The following HTTP response headers indicate that a specific request spilled over:
46
+
47
+
-`x-ms-spillover-from-<deployment-name>`. This header contains the PTU deployment name. The presence of this header indicates that the request was a spillover request.
48
+
49
+
-`x-ms-<deployment-name>`. This header contains the name of the deployment that served the request. If the request spilled over, the deployment name is the name of the standard deployment.
50
+
51
+
For a request that spilled over, if the standard deployment request failed for any reason, the original PTU response is used in the response to the customer. The customer sees a header `x-ms-spillover-error` that contains the response code of the spillover request (such as `429` or `500`) so that they know the reason for the failed spillover.
31
52
32
53
## How does spillover affect cost?
33
54
Since spillover uses a combination of provisioned and standard deployments to manage traffic fluctuations, billing for spillover involves two components:
@@ -124,4 +145,4 @@ Applying the `IsSpillover` split lets you view the requests to your deployment t
124
145
## See also
125
146
126
147
*[What is provisioned throughput](../concepts/provisioned-throughput.md)
127
-
*[Onboarding to provisioned throughput](./provisioned-throughput-onboarding.md)
148
+
*[Onboarding to provisioned throughput](./provisioned-throughput-onboarding.md)
0 commit comments