fix: AI Rate Limiting Bug Causing Tenant 429 Errors on /v1/ API Path by mishraomp · Pull Request #132 · bcgov/ai-hub-tracking

mishraomp · 2026-03-04T05:46:06Z

Fix: AI Rate Limiting Bug Causing Tenant 429 Errors on `/v1/` API Path

Summary

Tenants using the standard OpenAI-compatible /v1/chat/completions API format were being silently throttled to 1,000 tokens per minute — a tiny fraction of their allocated quota. This caused a flood of 429 Too Many Requests errors after just one or two API calls, with a mandatory 60-second lockout before they could try again.

The bug was invisible at first glance: individual calls still returned a success response. Only after the second request did the throttle kick in, making it appear as an intermittent or load-related problem rather than a misconfiguration.

Business Impact

Area	Detail
Who was affected	Any tenant using the `/v1/` API format (standard for most AI tools and SDKs),
What they experienced	Requests failing with `429 Too Many Requests` after just 1–2 calls; 60-second lockout per minute
Other API paths	Tenants using the older Azure-native URL format were not affected

What Was Happening

The AI Hub enforces per-tenant rate limits at the API gateway layer. Limits are defined per AI model — for example, gpt-4.1-mini is allocated 1,500,000 tokens per minute for each tenant.

When a request arrives via the /v1/ path, the gateway inspects the request to identify which model is being called, then applies the correct token budget. Due to a bug, the model name was being constructed as:

$tenant-4.1-mini   ← incorrect (tenant name was prepended)

instead of:

gpt-4.1-mini   ← correct

The gateway couldn't find a matching rate limit rule for that prefixed name, so it silently fell back to a catch-all default of 1,000 tokens per minute — regardless of the tenant's actual entitlement.

	Token Budget Applied
Before fix	1,000 tokens/min (catch-all fallback — 0.07% of entitlement)
After fix	1,500,000 tokens/min (correct allocation)

The tenant prefix is still correctly used for routing requests to the right backend — only the rate-limit lookup was affected.

How It Was Diagnosed

Tenant reported repeated 429 errors in the test environment.
APIM gateway logs confirmed the 429s were originating within the API gateway (not from the Azure OpenAI backend itself).
Inspecting response headers revealed the root cause:
```
x-ratelimit-limit-tokens:     1,500,000   ← Azure OpenAI backend knows the real limit
x-ratelimit-remaining-tokens: 986         ← APIM counter started at 1,000, not 1,500,000
```
After a tiny "Say hello" request consuming 14 tokens: 1,000 − 14 = 986. The math confirmed APIM was enforcing a 1,000 TPM limit, not 1,500,000.
A second request with a longer prompt consumed the remaining 986 tokens and triggered the 429.

The Fix

A one-line change in the API gateway policy template:

- if (!string.IsNullOrEmpty(model)) { return "${tenant_name}-" + model; }
+ if (!string.IsNullOrEmpty(model)) { return model; }

The tenant-name prefix is no longer added when resolving the model name for rate-limit matching. It is still correctly applied in the URL rewriting step that routes the request to the backend — that behaviour is unchanged.

Verified live: After applying the fix, the same "Say hello" request returned:

x-ratelimit-limit-tokens:     1,500,000
x-ratelimit-remaining-tokens: 1,499,986   ✓ correct

Why It Wasn't Caught Sooner

Existing automated integration tests only checked that API calls returned HTTP 200 (success). They never validated the rate-limit headers in the response.

The bug produced a successful first response — just with the wrong token budget silently applied. No test failed. The problem only surfaced when a real tenant made multiple calls in quick succession under production-like load.

Prevention: New Automated Tests

Three new regression tests have been added to the integration test suite. They will run on every deployment going forward:

Test	What It Checks
`/v1/ format token limit is not the 1,000 TPM fallback`	Asserts that the token budget reported for `/v1/` calls is greater than 1,000 — the fallback value that indicates broken matching
`/deployments/ format token limit is not the 1,000 TPM fallback`	Same check for the native Azure path
`/v1/ and /deployments/ report identical token limit for the same model`	Both paths must report the same budget — a mismatch immediately flags that one path is hitting the fallback while the other is not

Files Changed

File	Change
`infra-ai-hub/params/apim/api_policy.xml.tftpl`	Remove tenant-name prefix from model name in rate-limit lookup for `/v1/` paths
`tests/integration/test-helper.bash`	Add `apim_request_with_headers`, `parse_response_with_headers`, and `get_response_header` helpers
`tests/integration/v1-chat-completions.bats`	Add three rate-limit header regression tests

AI Hub Infra Changes

Summary: 1 to add, 6 to change, 0 to destroy (across 2 stack(s))

Show plan details

Terraform will perform the following actions:

  # azurerm_api_management_api_policy.tenant["ai-hub-admin"] will be updated in-place
  ~ resource "azurerm_api_management_api_policy" "tenant" {
        id                  = "/subscriptions/****/resourceGroups/ai-services-hub-test/providers/Microsoft.ApiManagement/service/ai-services-hub-test-apim/apis/ai-hub-admin"
      ~ xml_content         = <<-EOT
          - <policies>
          - 	<inbound>
          - 		<base />
          - 		<!-- Extract tracking dimensions from headers -->
          - 		<include-fragment fragment-id="tracking-dimensions" />
          - 		<!-- Tenant identification -->
          - 		<set-header name="X-Tenant-Id" exists-action="override">
          - 			<value>ai-hub-admin</value>
          - 		</set-header>
          - 		<!-- Per-model token rate limiting for OpenAI requests only -->
          - 		<!-- Each model has its own rate limit matching its Azure OpenAI deployment capacity -->
          - 		<!-- Only applies to /openai/* paths; DocInt/Speech/Search/Storage are not rate-limited by token count -->
          - 		<!-- CRITICAL: estimate-prompt-tokens reads the entire request body for tokenization. -->
          - 		<!-- On large binary payloads (e.g., 500KB base64 DocInt images) this causes APIM to hang. -->
          - 		<!-- Extracts deployment name from URL: /openai/deployments/{deployment-name}/... -->
          - 		<!-- For /v1/ format: extracts from request body "model" field (deployment name lookup key) -->
          - 		<choose>
          - 			<when condition="@(context.Request.Url.Path.ToLower().Contains(&quot;openai&quot;))">
          - 				<set-variable name="deploymentName" value="@{
          + <policies>
          +     <inbound>
          +         <base />
          +         <!-- Extract tracking dimensions from headers -->
          +         <include-fragment fragment-id="tracking-dimensions" />
          +         <!-- Tenant identification -->
          +         <set-header name="X-Tenant-Id" exists-action="override">
          +             <value>ai-hub-admin</value>
          +         </set-header>
          +         <!-- Per-model token rate limiting for OpenAI requests only -->
          +         <!-- Each model has its own rate limit matching its Azure OpenAI deployment capacity -->
          +         <!-- Only applies to /openai/* paths; DocInt/Speech/Search/Storage are not rate-limited by token count -->
          +         <!-- CRITICAL: estimate-prompt-tokens reads the entire request body for tokenization. -->
          +         <!-- On large binary payloads (e.g., 500KB base64 DocInt images) this causes APIM to hang. -->
          +         <!-- Extracts deployment name from URL: /openai/deployments/{deployment-name}/... -->
          +         <!-- For /v1/ format: extracts from request body "model" field (deployment name lookup key) -->
          +         <choose>
          +             <when condition="@(context.Request.Url.Path.ToLower().Contains(&quot;openai&quot;))">
          +                 <set-variable name="deploymentName" value="@{
                                var path = context.Request.Url.Path;
                                var match = System.Text.RegularExpressions.Regex.Match(path, @&quot;/deployments/([^/]+)/&quot;);
                                if (match.Success) { return match.Groups[1].Value; }
                                // For /v1/ format: model field is the deployment name lookup key on Azure OpenAI
          -                     // Client sends e.g. "gpt-4.1-mini"; tenant-prefix to match deployment name
          +                     // Client sends e.g. "gpt-4.1-mini"; use bare model name to match deployment names
          +                     // NOTE: do NOT prepend tenant prefix here — rate-limit <when> conditions compare
          +                     // against bare deployment names from tfvars (e.g. "gpt-4.1-mini", not "tenant-gpt-4.1-mini").
          +                     // URL rewriting (further below) adds the tenant prefix independently for backend routing.
                                if (path.ToLower().Contains(&quot;/v1/&quot;)) {
                                    try {
                                        var body = context.Request.Body.As&lt;JObject&gt;(preserveContent: true);
                                        var model = body?[&quot;model&quot;]?.ToString();
          -                             if (!string.IsNullOrEmpty(model)) { return &quot;ai-hub-admin-&quot; + model; }
          +                             if (!string.IsNullOrEmpty(model)) { return model; }
                                    } catch { }
                                }
                                return &quot;unknown&quot;;
          -                 }" />
          - 				<choose>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;gpt-4.1&quot;)">
          - 						<!-- Rate limit for gpt-4.1: 300k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-gpt-4.1&quot;)" tokens-per-minute="300000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;gpt-4.1-mini&quot;)">
          - 						<!-- Rate limit for gpt-4.1-mini: 1500k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-gpt-4.1-mini&quot;)" tokens-per-minute="1500000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;gpt-4.1-nano&quot;)">
          - 						<!-- Rate limit for gpt-4.1-nano: 1500k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-gpt-4.1-nano&quot;)" tokens-per-minute="1500000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;gpt-4o&quot;)">
          - 						<!-- Rate limit for gpt-4o: 300k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-gpt-4o&quot;)" tokens-per-minute="300000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;gpt-4o-mini&quot;)">
          - 						<!-- Rate limit for gpt-4o-mini: 1500k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-gpt-4o-mini&quot;)" tokens-per-minute="1500000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;gpt-5-mini&quot;)">
          - 						<!-- Rate limit for gpt-5-mini: 100k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-gpt-5-mini&quot;)" tokens-per-minute="100000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;gpt-5-nano&quot;)">
          - 						<!-- Rate limit for gpt-5-nano: 1500k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-gpt-5-nano&quot;)" tokens-per-minute="1500000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;gpt-5.1-chat&quot;)">
          - 						<!-- Rate limit for gpt-5.1-chat: 50k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-gpt-5.1-chat&quot;)" tokens-per-minute="50000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;gpt-5.1-codex-mini&quot;)">
          - 						<!-- Rate limit for gpt-5.1-codex-mini: 100k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-gpt-5.1-codex-mini&quot;)" tokens-per-minute="100000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;o1&quot;)">
          - 						<!-- Rate limit for o1: 50k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-o1&quot;)" tokens-per-minute="50000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;o3-mini&quot;)">
          - 						<!-- Rate limit for o3-mini: 50k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-o3-mini&quot;)" tokens-per-minute="50000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;o4-mini&quot;)">
          - 						<!-- Rate limit for o4-mini: 100k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-o4-mini&quot;)" tokens-per-minute="100000" estimate-prompt-tokens
(truncated, see workflow logs for complete plan)

Updated by CI — plan against test environment (run #239) at 2026-03-04 05:51:04 UTC.

mishraomp added 2 commits March 3, 2026 21:41

fix: tenant model when using v1 approach

abf80fe

integration tests improvement

ab532f3

mishraomp temporarily deployed to tools March 4, 2026 05:46 — with GitHub Actions Inactive

mishraomp changed the title ~~fix: apim v1 deployment~~ fix: AI Rate Limiting Bug Causing Tenant 429 Errors on /v1/ API Path Mar 4, 2026

mishraomp temporarily deployed to test March 4, 2026 05:48 — with GitHub Actions Inactive

mishraomp temporarily deployed to tools March 4, 2026 05:51 — with GitHub Actions Inactive

mishraomp temporarily deployed to test March 4, 2026 05:52 — with GitHub Actions Inactive

mishraomp temporarily deployed to test March 4, 2026 05:56 — with GitHub Actions Inactive

mishraomp merged commit 30655ed into main Mar 4, 2026
19 checks passed

mishraomp deleted the fix/apim-v1-deployment branch March 4, 2026 06:01

mishraomp self-assigned this Mar 4, 2026

mishraomp added the bug Something isn't working label Mar 4, 2026

mishraomp added this to CSS AI Hub Tracking Mar 4, 2026

github-project-automation bot moved this from Backlog to Done in CSS AI Hub Tracking Mar 4, 2026

github-project-automation bot moved this to Backlog in CSS AI Hub Tracking Mar 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: AI Rate Limiting Bug Causing Tenant 429 Errors on /v1/ API Path#132

fix: AI Rate Limiting Bug Causing Tenant 429 Errors on /v1/ API Path#132
mishraomp merged 2 commits intomainfrom
fix/apim-v1-deployment

mishraomp commented Mar 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mishraomp commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix: AI Rate Limiting Bug Causing Tenant 429 Errors on /v1/ API Path

Summary

Business Impact

What Was Happening

How It Was Diagnosed

The Fix

Why It Wasn't Caught Sooner

Prevention: New Automated Tests

Files Changed

AI Hub Infra Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mishraomp commented Mar 4, 2026 •

edited

Loading

Fix: AI Rate Limiting Bug Causing Tenant 429 Errors on `/v1/` API Path