You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+245Lines changed: 245 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -445,6 +445,251 @@ TBD
445
445
446
446
- Provide cost management per subscription
447
447
448
+
To effectively manage the costs associated with Azure OpenAI Service usage, you can implement policies in Azure API Management (APIM) that control and monitor the number of tokens consumed by each request. By limiting tokens per subscription and emitting detailed metrics, you can enforce usage quotas, prevent overuse, and enable charge-back models for cost recovery.
449
+
450
+
### Implementing Token Limits and Metrics Emission
451
+
452
+
The following APIM policy helps you manage costs by:
453
+
454
+
- Limiting the number of tokens a subscription can consume per minute.
455
+
- Estimating prompt tokens to include both prompt and completion tokens in the limit.
456
+
- Emitting token usage metrics with dimensions that help you analyze and report on token consumption per deployment and subscription.
457
+
458
+
#### APIM Policy Configuration
459
+
460
+
```
461
+
xmlCopy code<policies>
462
+
<inbound>
463
+
<!-- Set the backend service to your Azure OpenAI endpoint -->
-**Extract Deployment ID**: The `<set-variable>` element extracts the deployment ID from the request URL path. This is useful for tracking usage per model deployment.
-`tokens-per-minute`: The maximum number of tokens allowed per minute. Adjust this value according to your cost management strategy.
537
+
-`counter-key`: The key used to track the token count. Using `context.Subscription.Id` enforces the limit per subscription.
538
+
-`estimate-prompt-tokens`: When set to `true`, includes an estimate of the prompt tokens in the token count.
539
+
-`tokens-consumed-header-name` and `remaining-tokens-header-name`: Custom header names to include in the response, indicating tokens consumed and remaining.
540
+
541
+
-**Emit Token Metrics**: The `<azure-openai-emit-token-metric>` policy emits metrics for token usage, which can be used for monitoring and reporting.
542
+
543
+
```
544
+
xmlCopy code<azure-openai-emit-token-metric>
545
+
<dimension name="API ID" />
546
+
<dimension name="Subscription ID" />
547
+
<dimension name="User ID" />
548
+
<dimension name="Product ID" />
549
+
<!-- Add the extracted deployment ID as a custom dimension -->
- Each `<dimension>` element adds a custom dimension to the emitted metric. Including `Deployment ID` helps in tracking usage per model deployment.
555
+
556
+
-**Authentication with Managed Identity**: The `<authentication-managed-identity>` policy uses Managed Identity to authenticate with Azure Cognitive Services.
1.**Configure the Policy**: Add the above policy to your API in APIM under the inbound processing section.
569
+
2.**Adjust Token Limits**: Modify the `tokens-per-minute` value to set the desired token limit per subscription.
570
+
3.**Monitor Metrics**:
571
+
- Use Azure Monitor or Application Insights to collect and analyze the emitted metrics.
572
+
- Set up dashboards and alerts based on token consumption to proactively manage costs.
573
+
4.**Communicate Limits to Clients**:
574
+
- Inform your API consumers about the token limits.
575
+
- Clients can check the `consumed-tokens` and `remaining-tokens` headers in the response to monitor their usage.
576
+
577
+
#### Benefits
578
+
579
+
-**Cost Control**: By limiting the number of tokens per subscription, you prevent excessive usage that could lead to unexpectedly high costs.
580
+
-**Transparency**: Emitting token metrics with custom dimensions allows for detailed usage analysis, enabling charge-back models or internal billing.
581
+
-**Scalability**: Implementing token limits ensures that resources are fairly distributed among consumers, improving overall system performance.
582
+
583
+
#### Example Response Headers
584
+
585
+
When clients make requests, they can examine the response headers to see their token usage:
586
+
587
+
```
588
+
yamlCopy codeconsumed-tokens: 1500
589
+
remaining-tokens: 9850000
590
+
```
591
+
592
+
#### Handling Limit Exceeded Errors
593
+
594
+
If a client exceeds the token limit, APIM will return a **429 Too Many Requests** error. You can customize the error response using APIM policies to provide more context.
""message"": ""Token limit exceeded. Please retry after some time.""
611
+
}
612
+
}";
613
+
}</set-body>
614
+
</return-response>
615
+
</when>
616
+
</choose>
617
+
</on-error>
618
+
```
619
+
620
+
#### Monitoring and Reporting
621
+
622
+
By emitting token metrics with custom dimensions, you can set up monitoring and reporting to track token consumption per subscription, deployment, and other dimensions. This can be achieved using:
623
+
624
+
-**Azure Monitor Metrics**: Collect and analyze the custom metrics emitted by APIM.
625
+
-**Log Analytics**: Aggregate logs and perform queries to generate usage reports.
626
+
-**Alerts**: Configure alerts to notify when token usage approaches limits.
627
+
-**Power BI**: Configure reports that connect to Log Analytics data sources.
628
+
629
+
##### Log Analytics workspace via App Insights
630
+
631
+

632
+
633
+
##### Power BI
634
+
635
+

636
+
637
+
#### Implementing Charge-back Models
638
+
639
+
With detailed metrics, you can implement charge-back models where internal teams or external customers are billed based on their actual usage. By tracking token consumption per subscription, you can allocate costs accurately.
640
+
641
+
### Example: Setting Up a Charge-back Report
642
+
643
+
Use this KQL Query [AzureOpenAI-with-APIM/kql_queries/KQL-Token_Tracking_and_Cost.kql at main · microsoft/AzureOpenAI-with-APIM](https://github.com/microsoft/AzureOpenAI-with-APIM/blob/main/kql_queries/KQL-Token_Tracking_and_Cost.kql)
644
+
645
+
1.**Collect Metrics**: Ensure that the emitted metrics are being collected in Azure Monitor or Application Insights.
646
+
647
+
2.**Create a Log Analytics Workspace**: If you haven't already, create a Log Analytics workspace to store and query your metrics.
648
+
649
+
3.**Query Metrics**: Use Kusto Query Language (KQL) to query the metrics and aggregate token usage per subscription or deployment.
650
+
651
+
```
652
+
customMetrics
653
+
| where name != "_APPRESOURCEPREVIEW_" // Exclude unwanted records
654
+
| where isnotempty(tostring(customDimensions['Deployment ID'])) // Only include records with a Deployment ID
0 commit comments