Skip to content

Commit 152ba07

Browse files
committed
Learn Editor: Update latency.md
1 parent 240adb8 commit 152ba07

File tree

2 files changed

+8
-6
lines changed

2 files changed

+8
-6
lines changed

articles/ai-services/openai/how-to/latency.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -26,19 +26,21 @@ For a standard deployment, the quota assigned to your deployment partially deter
2626

2727
In a provisioned deployment, a set amount of model processing capacity is allocated to your endpoint. The amount of throughput that you can achieve on the endpoint is a factor of the workload shape including input token amount, output amount, call rate and cache match rate. The number of concurrent calls and total tokens processed can vary based on these values.
2828

29-
For all deployment types, understanding system level throughput is a key component of optimizing performance. It is important to consider system level throughput for a given model, version, and workload combination as the throughput will vary across these factors. The following section explains several approaches that can be used to estimate system level throughput with existing metrics and data from your Azure OpenAI Service environment.
29+
For all deployment types, understanding system level throughput is a key component of optimizing performance. It is important to consider system level throughput for a given model, version, and workload combination as the throughput will vary across these factors.
3030

3131
#### Estimating system level throughput
3232

33-
Understanding system level throughput for any workload involves multiple factors. At a high level, system level throughput is typically measured in tokens per minute (TPM). TPM data can be collected from Azure Monitor metrics, calculated using request-level token information, or estimated using common workload shapes.
33+
##### Estimating TPM with Azure Monitor metrics
3434

35-
##### Determining TPM from Azure Monitor metrics
35+
One approach to estimating system level throughput for a given workload is using historical token usage data. For Azure OpenAI workloads, all historical usage data can be accessed and visualized with the native Monitoring capabilities offered within Azure OpenAI. Two metrics are needed to estimate system level throughput for Azure OpenAI workloads: (1) **Processed Prompt Tokens** and (2) **Generated Completion Tokens**.
3636

37-
One approach to estimating system level throughput for a given workload is using historical usage data. For Azure OpenAI workloads, all historical usage data can be accessed and visualized with the native Monitoring capabilities offered within Azure OpenAI. Two metrics are needed to estimate system level throughput for Azure OpenAI workloads: (1) **Processed Prompt Tokens** and (2) **Generated Completion Tokens**.
37+
When combined, the **Processed Prompt Tokens** (input TPM) and **Generated Completion Tokens** (output TPM) metrics provide an estimated view of system level throughput based on actual workload traffic. This approach does not account for benefits from prompt caching, so it will be a conservative system throughput estimate. These metrics can be analyzed using minimum, average, and maximum aggregation over 1-minute windows across a multi-week time horizon. It is recommended to analyze this data over a multi-week time horizon to ensure there are enough data points to assess. The following screenshot shows an example of the **Processed Prompt Tokens** metric visualized in Azure Monitor, which is available directly through the Azure Portal.
3838

39-
When combined, the **Processed Prompt Tokens** (input TPM) and **Generated Completion Tokens** (output TPM) metrics provide an estimated view of system level throughput based on actual workload traffic. This approach does not account for benefits from prompt caching, so it will be a conservative system throughput estimate. These metrics can be analyzed using minimum, average, and maximum aggregation over 1 minute windows across a multi-week time horizon. It is recommended to analyze this data over a multi-week time horizon to ensure there are enough data points to assess. The following screenshot shows the Processed Prompt Tokens and Generated Completion Tokens metrics visualized in Azure Monitor, which is available directly through the Azure Portal.
39+
![User's image](media/latency/image.png)
4040

41-
##### Calculating TPM from request data
41+
##### Estimating TPM from request data
42+
43+
A second approach to estimated system level throughput involves collecting token usage information from API request data. This method provides a more granular approach to understanding workload shape per request. Combining per request token usage information with request volume, measured in requests per minute (RPM), provides an estimate for system level throughput. It is important to note that any assumptions made for consistency of token usage information across requests and request volume will impact the system throughput estimate. The following
4244

4345
##### Estimating TPM from common workload shapes
4446

105 KB
Loading

0 commit comments

Comments
 (0)