Skip to content

Commit 7329554

Browse files
committed
Learn Editor: Update latency.md
1 parent 152ba07 commit 7329554

File tree

1 file changed

+18
-2
lines changed

1 file changed

+18
-2
lines changed

articles/ai-services/openai/how-to/latency.md

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,11 +36,27 @@ One approach to estimating system level throughput for a given workload is using
3636

3737
When combined, the **Processed Prompt Tokens** (input TPM) and **Generated Completion Tokens** (output TPM) metrics provide an estimated view of system level throughput based on actual workload traffic. This approach does not account for benefits from prompt caching, so it will be a conservative system throughput estimate. These metrics can be analyzed using minimum, average, and maximum aggregation over 1-minute windows across a multi-week time horizon. It is recommended to analyze this data over a multi-week time horizon to ensure there are enough data points to assess. The following screenshot shows an example of the **Processed Prompt Tokens** metric visualized in Azure Monitor, which is available directly through the Azure Portal.
3838

39-
![User's image](media/latency/image.png)
39+
![Azure Monitor chart with processed prompt tokens metric line graph.](media/latency/image.png)
4040

4141
##### Estimating TPM from request data
4242

43-
A second approach to estimated system level throughput involves collecting token usage information from API request data. This method provides a more granular approach to understanding workload shape per request. Combining per request token usage information with request volume, measured in requests per minute (RPM), provides an estimate for system level throughput. It is important to note that any assumptions made for consistency of token usage information across requests and request volume will impact the system throughput estimate. The following
43+
A second approach to estimated system level throughput involves collecting token usage information from API request data. This method provides a more granular approach to understanding workload shape per request. Combining per request token usage information with request volume, measured in requests per minute (RPM), provides an estimate for system level throughput. It is important to note that any assumptions made for consistency of token usage information across requests and request volume will impact the system throughput estimate. The token usage output data can be found in the API response details for a given Azure OpenAI Service chat completions request.
44+
45+
```json
46+
{
47+
"body": {
48+
"id": "chatcmpl-7R1nGnsXO8n4oi9UPz2f3UHdgAYMn",
49+
"created": 1686676106,
50+
"choices": [...],
51+
"usage": {
52+
"completion_tokens": 557,
53+
"prompt_tokens": 33,
54+
"total_tokens": 590
55+
}
56+
}
57+
}
58+
```
59+
Assuming all requests for a given workload are uniform, the prompt tokens and completion tokens can each be multiplied by the estimated RPM to identify the input and output TPM for the given workload.
4460

4561
##### Estimating TPM from common workload shapes
4662

0 commit comments

Comments
 (0)