Skip to content

Commit a1dbc14

Browse files
committed
Added Chargeback details
1 parent 74530cd commit a1dbc14

File tree

3 files changed

+245
-0
lines changed

3 files changed

+245
-0
lines changed

README.md

Lines changed: 245 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -445,6 +445,251 @@ TBD
445445

446446
- Provide cost management per subscription
447447

448+
To effectively manage the costs associated with Azure OpenAI Service usage, you can implement policies in Azure API Management (APIM) that control and monitor the number of tokens consumed by each request. By limiting tokens per subscription and emitting detailed metrics, you can enforce usage quotas, prevent overuse, and enable charge-back models for cost recovery.
449+
450+
### Implementing Token Limits and Metrics Emission
451+
452+
The following APIM policy helps you manage costs by:
453+
454+
- Limiting the number of tokens a subscription can consume per minute.
455+
- Estimating prompt tokens to include both prompt and completion tokens in the limit.
456+
- Emitting token usage metrics with dimensions that help you analyze and report on token consumption per deployment and subscription.
457+
458+
#### APIM Policy Configuration
459+
460+
```
461+
xmlCopy code<policies>
462+
<inbound>
463+
<!-- Set the backend service to your Azure OpenAI endpoint -->
464+
<set-backend-service id="apim-generated-policy" backend-id="azure-openai-openai-endpoint" />
465+
466+
<!-- Extract the deployment ID from the URL path after '/deployments/' -->
467+
<set-variable name="deploymentId" value="@(context.Request.Url.Path.Split('/').ElementAtOrDefault(3))" />
468+
469+
<!-- Limit tokens per minute per subscription -->
470+
<azure-openai-token-limit
471+
tokens-per-minute="10000000"
472+
counter-key="@(context.Subscription.Id)"
473+
estimate-prompt-tokens="true"
474+
tokens-consumed-header-name="consumed-tokens"
475+
remaining-tokens-header-name="remaining-tokens" />
476+
477+
<!-- Emit token metrics with custom dimensions -->
478+
<azure-openai-emit-token-metric>
479+
<dimension name="API ID" />
480+
<dimension name="Subscription ID" />
481+
<dimension name="User ID" />
482+
<dimension name="Product ID" />
483+
<!-- Include the deployment ID as a custom dimension -->
484+
<dimension name="Deployment ID" value="@(context.Variables.GetValueOrDefault<string>("deploymentId", "unknown"))" />
485+
</azure-openai-emit-token-metric>
486+
487+
<!-- Authenticate using Managed Identity -->
488+
<authentication-managed-identity resource="https://cognitiveservices.azure.com/" />
489+
<base />
490+
</inbound>
491+
<backend>
492+
<base />
493+
</backend>
494+
<outbound>
495+
<base />
496+
</outbound>
497+
<on-error>
498+
<base />
499+
</on-error>
500+
</policies>
501+
```
502+
503+
#### Explanation
504+
505+
- **Set Backend Service**: The `<set-backend-service>` element directs the request to your Azure OpenAI endpoint.
506+
507+
```
508+
xml
509+
510+
511+
Copy code
512+
<set-backend-service id="apim-generated-policy" backend-id="azure-openai-openai-endpoint" />
513+
```
514+
515+
- **Extract Deployment ID**: The `<set-variable>` element extracts the deployment ID from the request URL path. This is useful for tracking usage per model deployment.
516+
517+
```
518+
xml
519+
520+
521+
Copy code
522+
<set-variable name="deploymentId" value="@(context.Request.Url.Path.Split('/').ElementAtOrDefault(3))" />
523+
```
524+
525+
- **Token Limit**: The `<azure-openai-token-limit>` policy limits the number of tokens that can be consumed per minute per subscription.
526+
527+
```
528+
xmlCopy code<azure-openai-token-limit
529+
tokens-per-minute="10000000"
530+
counter-key="@(context.Subscription.Id)"
531+
estimate-prompt-tokens="true"
532+
tokens-consumed-header-name="consumed-tokens"
533+
remaining-tokens-header-name="remaining-tokens" />
534+
```
535+
536+
- `tokens-per-minute`: The maximum number of tokens allowed per minute. Adjust this value according to your cost management strategy.
537+
- `counter-key`: The key used to track the token count. Using `context.Subscription.Id` enforces the limit per subscription.
538+
- `estimate-prompt-tokens`: When set to `true`, includes an estimate of the prompt tokens in the token count.
539+
- `tokens-consumed-header-name` and `remaining-tokens-header-name`: Custom header names to include in the response, indicating tokens consumed and remaining.
540+
541+
- **Emit Token Metrics**: The `<azure-openai-emit-token-metric>` policy emits metrics for token usage, which can be used for monitoring and reporting.
542+
543+
```
544+
xmlCopy code<azure-openai-emit-token-metric>
545+
<dimension name="API ID" />
546+
<dimension name="Subscription ID" />
547+
<dimension name="User ID" />
548+
<dimension name="Product ID" />
549+
<!-- Add the extracted deployment ID as a custom dimension -->
550+
<dimension name="Deployment ID" value="@(context.Variables.GetValueOrDefault<string>("deploymentId", "unknown"))" />
551+
</azure-openai-emit-token-metric>
552+
```
553+
554+
- Each `<dimension>` element adds a custom dimension to the emitted metric. Including `Deployment ID` helps in tracking usage per model deployment.
555+
556+
- **Authentication with Managed Identity**: The `<authentication-managed-identity>` policy uses Managed Identity to authenticate with Azure Cognitive Services.
557+
558+
```
559+
xml
560+
561+
562+
Copy code
563+
<authentication-managed-identity resource="https://cognitiveservices.azure.com/" />
564+
```
565+
566+
#### Steps to Implement
567+
568+
1. **Configure the Policy**: Add the above policy to your API in APIM under the inbound processing section.
569+
2. **Adjust Token Limits**: Modify the `tokens-per-minute` value to set the desired token limit per subscription.
570+
3. **Monitor Metrics**:
571+
- Use Azure Monitor or Application Insights to collect and analyze the emitted metrics.
572+
- Set up dashboards and alerts based on token consumption to proactively manage costs.
573+
4. **Communicate Limits to Clients**:
574+
- Inform your API consumers about the token limits.
575+
- Clients can check the `consumed-tokens` and `remaining-tokens` headers in the response to monitor their usage.
576+
577+
#### Benefits
578+
579+
- **Cost Control**: By limiting the number of tokens per subscription, you prevent excessive usage that could lead to unexpectedly high costs.
580+
- **Transparency**: Emitting token metrics with custom dimensions allows for detailed usage analysis, enabling charge-back models or internal billing.
581+
- **Scalability**: Implementing token limits ensures that resources are fairly distributed among consumers, improving overall system performance.
582+
583+
#### Example Response Headers
584+
585+
When clients make requests, they can examine the response headers to see their token usage:
586+
587+
```
588+
yamlCopy codeconsumed-tokens: 1500
589+
remaining-tokens: 9850000
590+
```
591+
592+
#### Handling Limit Exceeded Errors
593+
594+
If a client exceeds the token limit, APIM will return a **429 Too Many Requests** error. You can customize the error response using APIM policies to provide more context.
595+
596+
```
597+
xmlCopy code<on-error>
598+
<base />
599+
<choose>
600+
<when condition="@(context.Response.StatusCode == 429)">
601+
<return-response>
602+
<set-status code="429" reason="Too Many Requests" />
603+
<set-header name="Retry-After" exists-action="override">
604+
<value>60</value>
605+
</set-header>
606+
<set-body>@{
607+
return @"{
608+
""error"": {
609+
""code"": ""TooManyTokens"",
610+
""message"": ""Token limit exceeded. Please retry after some time.""
611+
}
612+
}";
613+
}</set-body>
614+
</return-response>
615+
</when>
616+
</choose>
617+
</on-error>
618+
```
619+
620+
#### Monitoring and Reporting
621+
622+
By emitting token metrics with custom dimensions, you can set up monitoring and reporting to track token consumption per subscription, deployment, and other dimensions. This can be achieved using:
623+
624+
- **Azure Monitor Metrics**: Collect and analyze the custom metrics emitted by APIM.
625+
- **Log Analytics**: Aggregate logs and perform queries to generate usage reports.
626+
- **Alerts**: Configure alerts to notify when token usage approaches limits.
627+
- **Power BI**: Configure reports that connect to Log Analytics data sources.
628+
629+
##### Log Analytics workspace via App Insights
630+
631+
![Log Analytics report on use](./images/log-analytics-use-report.png)
632+
633+
##### Power BI
634+
635+
![Power BI Report on Use](./images/power-bi-use-report.png)
636+
637+
#### Implementing Charge-back Models
638+
639+
With detailed metrics, you can implement charge-back models where internal teams or external customers are billed based on their actual usage. By tracking token consumption per subscription, you can allocate costs accurately.
640+
641+
### Example: Setting Up a Charge-back Report
642+
643+
Use this KQL Query [AzureOpenAI-with-APIM/kql_queries/KQL-Token_Tracking_and_Cost.kql at main · microsoft/AzureOpenAI-with-APIM](https://github.com/microsoft/AzureOpenAI-with-APIM/blob/main/kql_queries/KQL-Token_Tracking_and_Cost.kql)
644+
645+
1. **Collect Metrics**: Ensure that the emitted metrics are being collected in Azure Monitor or Application Insights.
646+
647+
2. **Create a Log Analytics Workspace**: If you haven't already, create a Log Analytics workspace to store and query your metrics.
648+
649+
3. **Query Metrics**: Use Kusto Query Language (KQL) to query the metrics and aggregate token usage per subscription or deployment.
650+
651+
```
652+
customMetrics
653+
| where name != "_APPRESOURCEPREVIEW_" // Exclude unwanted records
654+
| where isnotempty(tostring(customDimensions['Deployment ID'])) // Only include records with a Deployment ID
655+
| extend
656+
subscriptionId = tostring(customDimensions['Subscription ID']),
657+
deploymentId = tostring(customDimensions['Deployment ID']),
658+
tokens = toreal(value), // Extract the token count
659+
tokenType = case(
660+
name == "Prompt Tokens", "Prompt Tokens",
661+
name == "Completion Tokens", "Completion Tokens",
662+
"Other") // Identify token type
663+
| where tokenType in ("Prompt Tokens", "Completion Tokens") // Filter to relevant token types
664+
| extend
665+
// Calculate costs based on Deployment ID and Token Type, rounded to 2 decimal places
666+
promptTokenCost = round(case(
667+
deploymentId == "gpt-4o" and tokenType == "Prompt Tokens", tokens / 1000 * 0.03,
668+
deploymentId == "gpt-4o-global" and tokenType == "Prompt Tokens", tokens / 1000 * 0.04,
669+
deploymentId == "gpt-4" and tokenType == "Prompt Tokens", tokens / 1000 * 0.02,
670+
deploymentId == "gpt-35-turbo" and tokenType == "Prompt Tokens", tokens / 1000 * 0.0015,
671+
0.0), 3),
672+
completionTokenCost = round(case(
673+
deploymentId == "gpt-4o" and tokenType == "Completion Tokens", tokens / 1000 * 0.06,
674+
deploymentId == "gpt-4o-global" and tokenType == "Completion Tokens", tokens / 1000 * 0.07,
675+
deploymentId == "gpt-4" and tokenType == "Completion Tokens", tokens / 1000 * 0.05,
676+
deploymentId == "gpt-35-turbo" and tokenType == "Completion Tokens", tokens / 1000 * 0.002,
677+
0.0), 3)
678+
| summarize
679+
totalPromptTokens = sumif(tokens, tokenType == "Prompt Tokens"),
680+
totalCompletionTokens = sumif(tokens, tokenType == "Completion Tokens"),
681+
totalPromptTokenCost = round(sumif(promptTokenCost, tokenType == "Prompt Tokens"), 2),
682+
totalCompletionTokenCost = round(sumif(completionTokenCost, tokenType == "Completion Tokens"), 2)
683+
by subscriptionId, deploymentId // Group by Subscription ID and Deployment ID
684+
| extend
685+
totalCost = round(totalPromptTokenCost + totalCompletionTokenCost, 2) // Add total cost, rounded to 2 decimal places
686+
| order by totalCost desc // Sort by total cost
687+
```
688+
689+
4. **Generate Reports**: Use Azure Dashboards or Power BI to visualize the data and create reports for charge-back.
690+
691+
5. **Automate Billing**: Export the reports or integrate with billing systems to automate the charge-back process.
692+
448693
## Cost Forecasting
449694

450695
- Provide cost management per subscription
112 KB
Loading

images/power-bi-use-report.png

144 KB
Loading

0 commit comments

Comments
 (0)