Skip to content

fix: AI Rate Limiting Bug Causing Tenant 429 Errors on /v1/ API Path#132

Merged
mishraomp merged 2 commits intomainfrom
fix/apim-v1-deployment
Mar 4, 2026
Merged

fix: AI Rate Limiting Bug Causing Tenant 429 Errors on /v1/ API Path#132
mishraomp merged 2 commits intomainfrom
fix/apim-v1-deployment

Conversation

@mishraomp
Copy link
Collaborator

@mishraomp mishraomp commented Mar 4, 2026

Fix: AI Rate Limiting Bug Causing Tenant 429 Errors on /v1/ API Path

Summary

Tenants using the standard OpenAI-compatible /v1/chat/completions API format were being silently throttled to 1,000 tokens per minute — a tiny fraction of their allocated quota. This caused a flood of 429 Too Many Requests errors after just one or two API calls, with a mandatory 60-second lockout before they could try again.

The bug was invisible at first glance: individual calls still returned a success response. Only after the second request did the throttle kick in, making it appear as an intermittent or load-related problem rather than a misconfiguration.


Business Impact

Area Detail
Who was affected Any tenant using the /v1/ API format (standard for most AI tools and SDKs),
What they experienced Requests failing with 429 Too Many Requests after just 1–2 calls; 60-second lockout per minute
Other API paths Tenants using the older Azure-native URL format were not affected

What Was Happening

The AI Hub enforces per-tenant rate limits at the API gateway layer. Limits are defined per AI model — for example, gpt-4.1-mini is allocated 1,500,000 tokens per minute for each tenant.

When a request arrives via the /v1/ path, the gateway inspects the request to identify which model is being called, then applies the correct token budget. Due to a bug, the model name was being constructed as:

$tenant-4.1-mini   ← incorrect (tenant name was prepended)

instead of:

gpt-4.1-mini   ← correct

The gateway couldn't find a matching rate limit rule for that prefixed name, so it silently fell back to a catch-all default of 1,000 tokens per minute — regardless of the tenant's actual entitlement.

Token Budget Applied
Before fix 1,000 tokens/min (catch-all fallback — 0.07% of entitlement)
After fix 1,500,000 tokens/min (correct allocation)

The tenant prefix is still correctly used for routing requests to the right backend — only the rate-limit lookup was affected.


How It Was Diagnosed

  1. Tenant reported repeated 429 errors in the test environment.

  2. APIM gateway logs confirmed the 429s were originating within the API gateway (not from the Azure OpenAI backend itself).

  3. Inspecting response headers revealed the root cause:

    x-ratelimit-limit-tokens:     1,500,000   ← Azure OpenAI backend knows the real limit
    x-ratelimit-remaining-tokens: 986         ← APIM counter started at 1,000, not 1,500,000
    

    After a tiny "Say hello" request consuming 14 tokens: 1,000 − 14 = 986. The math confirmed APIM was enforcing a 1,000 TPM limit, not 1,500,000.

  4. A second request with a longer prompt consumed the remaining 986 tokens and triggered the 429.


The Fix

A one-line change in the API gateway policy template:

- if (!string.IsNullOrEmpty(model)) { return "${tenant_name}-" + model; }
+ if (!string.IsNullOrEmpty(model)) { return model; }

The tenant-name prefix is no longer added when resolving the model name for rate-limit matching. It is still correctly applied in the URL rewriting step that routes the request to the backend — that behaviour is unchanged.

Verified live: After applying the fix, the same "Say hello" request returned:

x-ratelimit-limit-tokens:     1,500,000
x-ratelimit-remaining-tokens: 1,499,986   ✓ correct

Why It Wasn't Caught Sooner

Existing automated integration tests only checked that API calls returned HTTP 200 (success). They never validated the rate-limit headers in the response.

The bug produced a successful first response — just with the wrong token budget silently applied. No test failed. The problem only surfaced when a real tenant made multiple calls in quick succession under production-like load.


Prevention: New Automated Tests

Three new regression tests have been added to the integration test suite. They will run on every deployment going forward:

Test What It Checks
/v1/ format token limit is not the 1,000 TPM fallback Asserts that the token budget reported for /v1/ calls is greater than 1,000 — the fallback value that indicates broken matching
/deployments/ format token limit is not the 1,000 TPM fallback Same check for the native Azure path
/v1/ and /deployments/ report identical token limit for the same model Both paths must report the same budget — a mismatch immediately flags that one path is hitting the fallback while the other is not

Files Changed

File Change
infra-ai-hub/params/apim/api_policy.xml.tftpl Remove tenant-name prefix from model name in rate-limit lookup for /v1/ paths
tests/integration/test-helper.bash Add apim_request_with_headers, parse_response_with_headers, and get_response_header helpers
tests/integration/v1-chat-completions.bats Add three rate-limit header regression tests

AI Hub Infra Changes

Summary: 1 to add, 6 to change, 0 to destroy (across 2 stack(s))

Show plan details
Terraform will perform the following actions:

  # azurerm_api_management_api_policy.tenant["ai-hub-admin"] will be updated in-place
  ~ resource "azurerm_api_management_api_policy" "tenant" {
        id                  = "/subscriptions/****/resourceGroups/ai-services-hub-test/providers/Microsoft.ApiManagement/service/ai-services-hub-test-apim/apis/ai-hub-admin"
      ~ xml_content         = <<-EOT
          - <policies>
          - 	<inbound>
          - 		<base />
          - 		<!-- Extract tracking dimensions from headers -->
          - 		<include-fragment fragment-id="tracking-dimensions" />
          - 		<!-- Tenant identification -->
          - 		<set-header name="X-Tenant-Id" exists-action="override">
          - 			<value>ai-hub-admin</value>
          - 		</set-header>
          - 		<!-- Per-model token rate limiting for OpenAI requests only -->
          - 		<!-- Each model has its own rate limit matching its Azure OpenAI deployment capacity -->
          - 		<!-- Only applies to /openai/* paths; DocInt/Speech/Search/Storage are not rate-limited by token count -->
          - 		<!-- CRITICAL: estimate-prompt-tokens reads the entire request body for tokenization. -->
          - 		<!-- On large binary payloads (e.g., 500KB base64 DocInt images) this causes APIM to hang. -->
          - 		<!-- Extracts deployment name from URL: /openai/deployments/{deployment-name}/... -->
          - 		<!-- For /v1/ format: extracts from request body "model" field (deployment name lookup key) -->
          - 		<choose>
          - 			<when condition="@(context.Request.Url.Path.ToLower().Contains(&quot;openai&quot;))">
          - 				<set-variable name="deploymentName" value="@{
          + <policies>
          +     <inbound>
          +         <base />
          +         <!-- Extract tracking dimensions from headers -->
          +         <include-fragment fragment-id="tracking-dimensions" />
          +         <!-- Tenant identification -->
          +         <set-header name="X-Tenant-Id" exists-action="override">
          +             <value>ai-hub-admin</value>
          +         </set-header>
          +         <!-- Per-model token rate limiting for OpenAI requests only -->
          +         <!-- Each model has its own rate limit matching its Azure OpenAI deployment capacity -->
          +         <!-- Only applies to /openai/* paths; DocInt/Speech/Search/Storage are not rate-limited by token count -->
          +         <!-- CRITICAL: estimate-prompt-tokens reads the entire request body for tokenization. -->
          +         <!-- On large binary payloads (e.g., 500KB base64 DocInt images) this causes APIM to hang. -->
          +         <!-- Extracts deployment name from URL: /openai/deployments/{deployment-name}/... -->
          +         <!-- For /v1/ format: extracts from request body "model" field (deployment name lookup key) -->
          +         <choose>
          +             <when condition="@(context.Request.Url.Path.ToLower().Contains(&quot;openai&quot;))">
          +                 <set-variable name="deploymentName" value="@{
                                var path = context.Request.Url.Path;
                                var match = System.Text.RegularExpressions.Regex.Match(path, @&quot;/deployments/([^/]+)/&quot;);
                                if (match.Success) { return match.Groups[1].Value; }
                                // For /v1/ format: model field is the deployment name lookup key on Azure OpenAI
          -                     // Client sends e.g. "gpt-4.1-mini"; tenant-prefix to match deployment name
          +                     // Client sends e.g. "gpt-4.1-mini"; use bare model name to match deployment names
          +                     // NOTE: do NOT prepend tenant prefix here — rate-limit <when> conditions compare
          +                     // against bare deployment names from tfvars (e.g. "gpt-4.1-mini", not "tenant-gpt-4.1-mini").
          +                     // URL rewriting (further below) adds the tenant prefix independently for backend routing.
                                if (path.ToLower().Contains(&quot;/v1/&quot;)) {
                                    try {
                                        var body = context.Request.Body.As&lt;JObject&gt;(preserveContent: true);
                                        var model = body?[&quot;model&quot;]?.ToString();
          -                             if (!string.IsNullOrEmpty(model)) { return &quot;ai-hub-admin-&quot; + model; }
          +                             if (!string.IsNullOrEmpty(model)) { return model; }
                                    } catch { }
                                }
                                return &quot;unknown&quot;;
          -                 }" />
          - 				<choose>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;gpt-4.1&quot;)">
          - 						<!-- Rate limit for gpt-4.1: 300k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-gpt-4.1&quot;)" tokens-per-minute="300000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;gpt-4.1-mini&quot;)">
          - 						<!-- Rate limit for gpt-4.1-mini: 1500k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-gpt-4.1-mini&quot;)" tokens-per-minute="1500000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;gpt-4.1-nano&quot;)">
          - 						<!-- Rate limit for gpt-4.1-nano: 1500k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-gpt-4.1-nano&quot;)" tokens-per-minute="1500000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;gpt-4o&quot;)">
          - 						<!-- Rate limit for gpt-4o: 300k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-gpt-4o&quot;)" tokens-per-minute="300000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;gpt-4o-mini&quot;)">
          - 						<!-- Rate limit for gpt-4o-mini: 1500k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-gpt-4o-mini&quot;)" tokens-per-minute="1500000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;gpt-5-mini&quot;)">
          - 						<!-- Rate limit for gpt-5-mini: 100k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-gpt-5-mini&quot;)" tokens-per-minute="100000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;gpt-5-nano&quot;)">
          - 						<!-- Rate limit for gpt-5-nano: 1500k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-gpt-5-nano&quot;)" tokens-per-minute="1500000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;gpt-5.1-chat&quot;)">
          - 						<!-- Rate limit for gpt-5.1-chat: 50k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-gpt-5.1-chat&quot;)" tokens-per-minute="50000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;gpt-5.1-codex-mini&quot;)">
          - 						<!-- Rate limit for gpt-5.1-codex-mini: 100k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-gpt-5.1-codex-mini&quot;)" tokens-per-minute="100000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;o1&quot;)">
          - 						<!-- Rate limit for o1: 50k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-o1&quot;)" tokens-per-minute="50000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;o3-mini&quot;)">
          - 						<!-- Rate limit for o3-mini: 50k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-o3-mini&quot;)" tokens-per-minute="50000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;o4-mini&quot;)">
          - 						<!-- Rate limit for o4-mini: 100k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-o4-mini&quot;)" tokens-per-minute="100000" estimate-prompt-tokens
(truncated, see workflow logs for complete plan)

Updated by CI — plan against test environment (run #239) at 2026-03-04 05:51:04 UTC.

@mishraomp mishraomp changed the title fix: apim v1 deployment fix: AI Rate Limiting Bug Causing Tenant 429 Errors on /v1/ API Path Mar 4, 2026
@mishraomp mishraomp merged commit 30655ed into main Mar 4, 2026
19 checks passed
@mishraomp mishraomp deleted the fix/apim-v1-deployment branch March 4, 2026 06:01
@mishraomp mishraomp self-assigned this Mar 4, 2026
@mishraomp mishraomp added the bug Something isn't working label Mar 4, 2026
@github-project-automation github-project-automation bot moved this from Backlog to Done in CSS AI Hub Tracking Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

1 participant