fix: AI Rate Limiting Bug Causing Tenant 429 Errors on /v1/ API Path#132
Merged
fix: AI Rate Limiting Bug Causing Tenant 429 Errors on /v1/ API Path#132
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix: AI Rate Limiting Bug Causing Tenant 429 Errors on
/v1/API PathSummary
Tenants using the standard OpenAI-compatible
/v1/chat/completionsAPI format were being silently throttled to 1,000 tokens per minute — a tiny fraction of their allocated quota. This caused a flood of429 Too Many Requestserrors after just one or two API calls, with a mandatory 60-second lockout before they could try again.The bug was invisible at first glance: individual calls still returned a success response. Only after the second request did the throttle kick in, making it appear as an intermittent or load-related problem rather than a misconfiguration.
Business Impact
/v1/API format (standard for most AI tools and SDKs),429 Too Many Requestsafter just 1–2 calls; 60-second lockout per minuteWhat Was Happening
The AI Hub enforces per-tenant rate limits at the API gateway layer. Limits are defined per AI model — for example,
gpt-4.1-miniis allocated 1,500,000 tokens per minute for each tenant.When a request arrives via the
/v1/path, the gateway inspects the request to identify which model is being called, then applies the correct token budget. Due to a bug, the model name was being constructed as:instead of:
The gateway couldn't find a matching rate limit rule for that prefixed name, so it silently fell back to a catch-all default of 1,000 tokens per minute — regardless of the tenant's actual entitlement.
The tenant prefix is still correctly used for routing requests to the right backend — only the rate-limit lookup was affected.
How It Was Diagnosed
Tenant reported repeated 429 errors in the test environment.
APIM gateway logs confirmed the 429s were originating within the API gateway (not from the Azure OpenAI backend itself).
Inspecting response headers revealed the root cause:
After a tiny "Say hello" request consuming 14 tokens:
1,000 − 14 = 986. The math confirmed APIM was enforcing a 1,000 TPM limit, not 1,500,000.A second request with a longer prompt consumed the remaining 986 tokens and triggered the 429.
The Fix
A one-line change in the API gateway policy template:
The tenant-name prefix is no longer added when resolving the model name for rate-limit matching. It is still correctly applied in the URL rewriting step that routes the request to the backend — that behaviour is unchanged.
Verified live: After applying the fix, the same "Say hello" request returned:
Why It Wasn't Caught Sooner
Existing automated integration tests only checked that API calls returned HTTP 200 (success). They never validated the rate-limit headers in the response.
The bug produced a successful first response — just with the wrong token budget silently applied. No test failed. The problem only surfaced when a real tenant made multiple calls in quick succession under production-like load.
Prevention: New Automated Tests
Three new regression tests have been added to the integration test suite. They will run on every deployment going forward:
/v1/ format token limit is not the 1,000 TPM fallback/v1/calls is greater than 1,000 — the fallback value that indicates broken matching/deployments/ format token limit is not the 1,000 TPM fallback/v1/ and /deployments/ report identical token limit for the same modelFiles Changed
infra-ai-hub/params/apim/api_policy.xml.tftpl/v1/pathstests/integration/test-helper.bashapim_request_with_headers,parse_response_with_headers, andget_response_headerhelperstests/integration/v1-chat-completions.batsAI Hub Infra Changes
Summary: 1 to add, 6 to change, 0 to destroy (across 2 stack(s))
Show plan details
Updated by CI — plan against
testenvironment (run #239) at 2026-03-04 05:51:04 UTC.