You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-services/openai/how-to/latency.md
+18-4Lines changed: 18 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,18 +17,32 @@ ms.custom:
17
17
This article provides you with background around how latency and throughput works with Azure OpenAI and how to optimize your environment to improve performance.
18
18
19
19
## Understanding throughput vs latency
20
-
There are two key concepts to think about when sizing an application: (1) System level throughput and (2) Per-call response times (also known as Latency).
20
+
There are two key concepts to think about when sizing an application: (1) System level throughput measured in tokens per minute (TPM) and (2) Per-call response times (also known as Latency).
21
21
22
22
### System level throughput
23
23
This looks at the overall capacity of your deployment – how many requests per minute and total tokens that can be processed.
24
24
25
25
For a standard deployment, the quota assigned to your deployment partially determines the amount of throughput you can achieve. However, quota only determines the admission logic for calls to the deployment and isn't directly enforcing throughput. Due to per-call latency variations, you might not be able to achieve throughput as high as your quota. [Learn more on managing quota](./quota.md).
26
26
27
-
In a provisioned deployment, A set amount of model processing capacity is allocated to your endpoint. The amount of throughput that you can achieve on the endpoint is a factor of the input size, output size, call rate and cache match rate. The number of concurrent calls and total tokens processed can vary based on these values. The following steps walk through how to assess the throughput you can get a given workload in a provisioned deployment:
27
+
In a provisioned deployment, a set amount of model processing capacity is allocated to your endpoint. The amount of throughput that you can achieve on the endpoint is a factor of the workload shape including input token amount, output amount, call rate and cache match rate. The number of concurrent calls and total tokens processed can vary based on these values.
28
28
29
-
1. Use the Capacity calculator for a sizing estimate.
29
+
For all deployment types, system level throughput is a key component of performance. The following section explains several approaches that can be used to estimate system level throughput with existing metrics and data from your Azure OpenAI Service environment.
30
30
31
-
2. Benchmark the load using real traffic workload. Measure the utilization & tokens processed metrics from Azure Monitor. Run for an extended period. The [Azure OpenAI Benchmarking repository](https://aka.ms/aoai/benchmarking) contains code for running the benchmark. Finally, the most accurate approach is to run a test with your own data and workload characteristics.
31
+
#### Estimating system level throughput
32
+
33
+
Understanding system level throughput for any workload involves multiple factors. At a high level, system level throughput is typically measured in tokens per minute (TPM). TPM data can be collected from Azure Monitor metrics, calculated using request-level token information, or estimated using common workload shapes.
34
+
35
+
##### Determining TPM from Azure Monitor metrics
36
+
37
+
##### Calculating TPM from request data
38
+
39
+
##### Estimating TPM from common workload shapes
40
+
41
+
There are two approaches that can be used to estimate the amount of model processing capacity needed to support a given workload:
42
+
43
+
1. Use the built-in capacity calculator in the Azure OpenAI deployment creation workflow in the Azure AI Studio
44
+
45
+
1. Use the expanded Azure OpenAI capacity calculator in the Azure AI Studio
0 commit comments