Skip to content

chore: update tenant-info to show live endpoints#120

Merged
mishraomp merged 1 commit intomainfrom
feat/tenant-info-endpoints
Feb 28, 2026
Merged

chore: update tenant-info to show live endpoints#120
mishraomp merged 1 commit intomainfrom
feat/tenant-info-endpoints

Conversation

@mishraomp
Copy link
Collaborator

@mishraomp mishraomp commented Feb 28, 2026

AI Hub Infra Changes

Summary: 0 to add, 4 to change, 0 to destroy (across 1 stack(s))

Show plan details
Terraform will perform the following actions:

  # azurerm_api_management_api_policy.tenant["gcpe-media-monitoring"] will be updated in-place
  ~ resource "azurerm_api_management_api_policy" "tenant" {
        id                  = "/subscriptions/****/resourceGroups/ai-services-hub-test/providers/Microsoft.ApiManagement/service/ai-services-hub-test-apim/apis/gcpe-media-monitoring"
      ~ xml_content         = <<-EOT
          - <policies>
          - 	<inbound>
          - 		<base />
          - 		<!-- Extract tracking dimensions from headers -->
          - 		<include-fragment fragment-id="tracking-dimensions" />
          - 		<!-- Tenant identification -->
          - 		<set-header name="X-Tenant-Id" exists-action="override">
          - 			<value>gcpe-media-monitoring</value>
          - 		</set-header>
          - 		<!-- Per-model token rate limiting for OpenAI requests only -->
          - 		<!-- Each model has its own rate limit matching its Azure OpenAI deployment capacity -->
          - 		<!-- Only applies to /openai/* paths; DocInt/Speech/Search/Storage are not rate-limited by token count -->
          - 		<!-- CRITICAL: estimate-prompt-tokens reads the entire request body for tokenization. -->
          - 		<!-- On large binary payloads (e.g., 500KB base64 DocInt images) this causes APIM to hang. -->
          - 		<!-- Extracts deployment name from URL: /openai/deployments/{deployment-name}/... -->
          - 		<!-- For /v1/ format: extracts from request body "model" field (deployment name lookup key) -->
          - 		<choose>
          - 			<when condition="@(context.Request.Url.Path.ToLower().Contains(&quot;openai&quot;))">
          - 				<set-variable name="deploymentName" value="@{
          + <policies>
          +     <inbound>
          +         <base />
          +         <!-- Extract tracking dimensions from headers -->
          +         <include-fragment fragment-id="tracking-dimensions" />
          +         <!-- Tenant identification -->
          +         <set-header name="X-Tenant-Id" exists-action="override">
          +             <value>gcpe-media-monitoring</value>
          +         </set-header>
          +         <!-- Per-model token rate limiting for OpenAI requests only -->
          +         <!-- Each model has its own rate limit matching its Azure OpenAI deployment capacity -->
          +         <!-- Only applies to /openai/* paths; DocInt/Speech/Search/Storage are not rate-limited by token count -->
          +         <!-- CRITICAL: estimate-prompt-tokens reads the entire request body for tokenization. -->
          +         <!-- On large binary payloads (e.g., 500KB base64 DocInt images) this causes APIM to hang. -->
          +         <!-- Extracts deployment name from URL: /openai/deployments/{deployment-name}/... -->
          +         <!-- For /v1/ format: extracts from request body "model" field (deployment name lookup key) -->
          +         <choose>
          +             <when condition="@(context.Request.Url.Path.ToLower().Contains(&quot;openai&quot;))">
          +                 <set-variable name="deploymentName" value="@{
                                var path = context.Request.Url.Path;
                                var match = System.Text.RegularExpressions.Regex.Match(path, @&quot;/deployments/([^/]+)/&quot;);
                                if (match.Success) { return match.Groups[1].Value; }
                                // For /v1/ format: model field is the deployment name lookup key on Azure OpenAI
                                // Client sends e.g. "gpt-4.1-mini"; tenant-prefix to match deployment name
                                if (path.ToLower().Contains(&quot;/v1/&quot;)) {
                                    try {
                                        var body = context.Request.Body.As&lt;JObject&gt;(preserveContent: true);
                                        var model = body?[&quot;model&quot;]?.ToString();
                                        if (!string.IsNullOrEmpty(model)) { return &quot;gcpe-media-monitoring-&quot; + model; }
                                    } catch { }
                                }
                                return &quot;unknown&quot;;
          -                 }" />
          - 				<choose>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;gpt-4.1&quot;)">
          - 						<!-- Rate limit for gpt-4.1: 300k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-gpt-4.1&quot;)" tokens-per-minute="300000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;gpt-4.1-mini&quot;)">
          - 						<!-- Rate limit for gpt-4.1-mini: 1500k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-gpt-4.1-mini&quot;)" tokens-per-minute="1500000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;gpt-4.1-nano&quot;)">
          - 						<!-- Rate limit for gpt-4.1-nano: 1500k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-gpt-4.1-nano&quot;)" tokens-per-minute="1500000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;gpt-4o&quot;)">
          - 						<!-- Rate limit for gpt-4o: 300k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-gpt-4o&quot;)" tokens-per-minute="300000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;gpt-4o-mini&quot;)">
          - 						<!-- Rate limit for gpt-4o-mini: 1500k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-gpt-4o-mini&quot;)" tokens-per-minute="1500000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;gpt-5-mini&quot;)">
          - 						<!-- Rate limit for gpt-5-mini: 100k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-gpt-5-mini&quot;)" tokens-per-minute="100000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;gpt-5-nano&quot;)">
          - 						<!-- Rate limit for gpt-5-nano: 1500k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-gpt-5-nano&quot;)" tokens-per-minute="1500000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;gpt-5.1-chat&quot;)">
          - 						<!-- Rate limit for gpt-5.1-chat: 50k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-gpt-5.1-chat&quot;)" tokens-per-minute="50000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;gpt-5.1-codex-mini&quot;)">
          - 						<!-- Rate limit for gpt-5.1-codex-mini: 100k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-gpt-5.1-codex-mini&quot;)" tokens-per-minute="100000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;o1&quot;)">
          - 						<!-- Rate limit for o1: 50k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-o1&quot;)" tokens-per-minute="50000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;o3-mini&quot;)">
          - 						<!-- Rate limit for o3-mini: 50k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-o3-mini&quot;)" tokens-per-minute="50000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;o4-mini&quot;)">
          - 						<!-- Rate limit for o4-mini: 100k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counter-key="@(context.Subscription.Id + &quot;-o4-mini&quot;)" tokens-per-minute="100000" estimate-prompt-tokens="true" remaining-tokens-variable-name="remainingTokens" remaining-tokens-header-name="x-ratelimit-remaining-tokens" tokens-consumed-variable-name="tokensConsumed" />
          - 					</when>
          - 					<when condition="@(context.Variables.GetValueOrDefault&lt;string&gt;(&quot;deploymentName&quot;, &quot;&quot;) == &quot;text-embedding-ada-002&quot;)">
          - 						<!-- Rate limit for text-embedding-ada-002: 100k TPM (deployment capacity is in thousands of TPM) -->
          - 						<llm-token-limit counte
(truncated, see workflow logs for complete plan)

Updated by CI — plan against test environment (run #216).

@mishraomp mishraomp self-assigned this Feb 28, 2026
@mishraomp mishraomp added documentation Improvements or additions to documentation enhancement New feature or request Task Terraform devops labels Feb 28, 2026
@mishraomp mishraomp merged commit ac5bbec into main Feb 28, 2026
11 checks passed
@mishraomp mishraomp deleted the feat/tenant-info-endpoints branch February 28, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops documentation Improvements or additions to documentation enhancement New feature or request Task Terraform

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant