You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-services/openai/concepts/provisioned-throughput.md
+7-7Lines changed: 7 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,24 +25,24 @@ The provisioned throughput capability allows you to specify the amount of throug
25
25
An Azure OpenAI Deployment is a unit of management for a specific OpenAI Model. A deployment provides customer access to a model for inference and integrates more features like Content Moderation ([See content moderation documentation](content-filter.md)).
26
26
27
27
> [!NOTE]
28
-
> Provisioned throughput unit(PTU) quota is different from standard quota in Azure OpenAI and are not available by default. To learn more about this offering contact your Microsoft Account Team.
28
+
> Provisioned throughput unit(PTU) quota is different from standard quota in Azure OpenAI and is not available by default. To learn more about this offering contact your Microsoft Account Team.
29
29
30
30
## What do you get?
31
31
32
32
| Topic | Provisioned|
33
33
|---|---|
34
-
| What is it? | Provides guaranteed throughput at smaller increments than the existing provisioned offer. Deployments have a consistent max latency for a given model-version |
34
+
| What is it? | Provides guaranteed throughput at smaller increments than the existing provisioned offer. Deployments have a consistent max latency for a given model-version.|
35
35
| Who is it for? | Customers who want guaranteed throughput with minimal latency variance. |
36
-
| Quota | Provisioned-managed throughput Units for a given model |
36
+
| Quota | Provisioned-managed throughput Units for a given model.|
37
37
| Latency | Max latency constrained from the model. Overall latency is a factor of call shape. |
| Estimating size | Provided calculator in the studio & benchmarking script |
38
+
| Utilization | Provisioned-managed Utilization measure provided in Azure Monitor.|
39
+
| Estimating size | Provided calculator in the studio & benchmarking script.|
40
40
41
41
## Key concepts
42
42
43
43
### Provisioned throughput units
44
44
45
-
Provisioned throughput Units (PTU) are units of model processing capacity that customers you can reserve and deploy for processing prompts and generating completions. The minimum PTU deployment, increments, and processing capacity associated with each unit varies by model type & version.
45
+
Provisioned throughput units (PTU) are units of model processing capacity that customers you can reserve and deploy for processing prompts and generating completions. The minimum PTU deployment, increments, and processing capacity associated with each unit varies by model type & version.
46
46
47
47
### Deployment types
48
48
@@ -57,7 +57,7 @@ az cognitiveservices account deployment create \
Copy file name to clipboardExpand all lines: articles/ai-services/openai/how-to/latency.md
+9-9Lines changed: 9 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -66,9 +66,9 @@ o Generate fewer responses: The best_of & n parameters can greatly increase late
66
66
In summary, reducing the number of tokens generated per request reduces the latency of each request.
67
67
68
68
### Streaming
69
-
Setting `stream: true` in a request makes the service return tokens as soon as they're available, instead of waiting for the full sequence of tokens to be generated. It doesn't change the time to get all the tokens, but it reduces the time for first response. This aproach provides a better user experience since end-suers can read the response as it is generated.
69
+
Setting `stream: true` in a request makes the service return tokens as soon as they're available, instead of waiting for the full sequence of tokens to be generated. It doesn't change the time to get all the tokens, but it reduces the time for first response. This approach provides a better user experience since end-users can read the response as it is generated.
70
70
71
-
Streaming is also valuable with large calls that take a long time to process. Many clients and intermediary layers have timeouts on individual calls. Long generation calls might be cancelled due to client-side time outs. By streaming the data back, you can ensure incremental data is received.
71
+
Streaming is also valuable with large calls that take a long time to process. Many clients and intermediary layers have timeouts on individual calls. Long generation calls might be canceled due to client-side time outs. By streaming the data back, you can ensure incremental data is received.
72
72
73
73
74
74
@@ -82,7 +82,7 @@ Streaming impacts perceived latency. With streaming enabled you receive tokens b
82
82
83
83
Sentiment analysis, language translation, content generation.
84
84
85
-
There are many use cases where you are performing some bulk task where you only care about the finished result, not the real-time response. If streaming is disabled, you won't receive any tokens until the model has finished the entire response.
85
+
There are many use cases where you're performing some bulk task where you only care about the finished result, not the real-time response. If streaming is disabled, you won't receive any tokens until the model has finished the entire response.
86
86
87
87
### Content filtering
88
88
@@ -96,23 +96,23 @@ Learn more about requesting modifications to the default, [content filtering pol
96
96
97
97
98
98
### Separation of workloads
99
-
Mixing different workloads on the same endpoint can negatively affect latency. This is because (1) they are batched together during inference and short calls can be waiting for longer completions and (2) mixing the calls can reduce your cache hit rate as they are both competing for the same space. When possible, it is recommended to have separate deployments for each workload.
99
+
Mixing different workloads on the same endpoint can negatively affect latency. This is because (1) they're batched together during inference and short calls can be waiting for longer completions and (2) mixing the calls can reduce your cache hit rate as they're both competing for the same space. When possible, it's recommended to have separate deployments for each workload.
100
100
101
101
### Prompt Size
102
-
While prompt size has smaller affect on latency than the generation size it will affect the overall time, especially when the size grows large.
102
+
While prompt size has smaller influence on latency than the generation size it affects the overall time, especially when the size grows large.
103
103
104
104
### Batching
105
-
If you are sending multiple requests to the same endpoint, you can batch the requests into a single call. This will reduce the number of requests you need to make and depending on the scenario it might improve overall response time. We recommend testing this method to see if it helps.
105
+
If you're sending multiple requests to the same endpoint, you can batch the requests into a single call. This reduces the number of requests you need to make and depending on the scenario it might improve overall response time. We recommend testing this method to see if it helps.
106
106
107
107
## How to measure your throughput
108
108
We recommend measuring your overall throughput on a deployment with two measures:
109
-
- Calls per minute: The number of API inference calls you are making per minute. This can be measured in Azure-monitor using the Azure OpenAI Requests metric and splitting by the ModelDeploymentName
109
+
- Calls per minute: The number of API inference calls you're making per minute. This can be measured in Azure-monitor using the Azure OpenAI Requests metric and splitting by the ModelDeploymentName
110
110
- Total Tokens per minute: The total number of tokens being processed per minute by your deployment. This includes prompt & generated tokens. This is often further split into measuring both for a deeper understanding of deployment performance. This can be measured in Azure-Monitor using the Processed Inference tokens metric.
111
111
112
112
You can learn more about [Monitoring the Azure OpenAI Service](./monitoring.md).
113
113
114
114
## How to measure per-call latency
115
-
The time it takes for each call depends on how long it takes to read the model, generate the output, and apply content filters. The way you measure the time will vary if you are using streaming or not. We suggest a different set of measures for each case.
115
+
The time it takes for each call depends on how long it takes to read the model, generate the output, and apply content filters. The way you measure the time will vary if you're using streaming or not. We suggest a different set of measures for each case.
116
116
117
117
You can learn more about [Monitoring the Azure OpenAI Service](./monitoring.md).
118
118
@@ -128,7 +128,7 @@ Time from the first token to the last token, divided by the number of generated
128
128
129
129
## Summary
130
130
131
-
***Model latency**: If model latency is important to you we recommend trying out our latest models in the [GPT-3.5 Turbo model series](../concepts/models.md).
131
+
***Model latency**: If model latency is important to you, we recommend trying out our latest models in the [GPT-3.5 Turbo model series](../concepts/models.md).
132
132
133
133
***Lower max tokens**: OpenAI has found that even in cases where the total number of tokens generated is similar the request with the higher value set for the max token parameter will have more latency.
Copy file name to clipboardExpand all lines: articles/ai-services/openai/how-to/provisioned-get-started.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -46,7 +46,7 @@ After you purchase a commitment on your quota, you can create a deployment. To c
46
46
| Select a model| Choose the specific model you wish to deploy. | GPT-4 |
47
47
| Model version | Choose the version of the model to deploy. | 0613 |
48
48
| Deployment Name | The deployment name is used in your code to call the model by using the client libraries and the REST APIs. | gpt-4|
49
-
| Content filter | Specify the filtering policy to apply to the deployment. Learn more on our [Content Filtering](../concepts/content-filter.md) how-tow| Default |
49
+
| Content filter | Specify the filtering policy to apply to the deployment. Learn more on our [Content Filtering](../concepts/content-filter.md) how-to.| Default |
50
50
| Deployment Type |This impacts the throughput and performance. Choose Provisioned-Managed for your provisioned deployment | Provisioned-Managed |
51
51
| Provisioned Throughput Units | Choose the amount of throughput you wish to include in the deployment. | 100 |
52
52
@@ -62,13 +62,13 @@ az cognitiveservices account deployment create \
62
62
--model-version 0613 \
63
63
--model-format OpenAI \
64
64
--sku-capacity 100 \
65
-
--sku-name Provisioned-Managed
65
+
--sku-name ProvisionedManaged
66
66
```
67
67
68
-
REST, ARM template, Bicep and Terraform can also be used to create deployments. See the section on automating deployments in the [Managing Quota](https://learn.microsoft.com/azure/ai-services/openai/how-to/quota?tabs=rest#automate-deployment) how-to guide and replace the `sku.name` with "Provisioned-Managed" rather than "Standard."
68
+
REST, ARM template, Bicep and Terraform can also be used to create deployments. See the section on automating deployments in the [Managing Quota](https://learn.microsoft.com/azure/ai-services/openai/how-to/quota?tabs=rest#automate-deployment) how-to guide and replace the `sku.name` with "ProvisionedManaged" rather than "Standard."
69
69
70
70
## Make your first calls
71
-
The inferencing code for provisioned deployments is the same a standard deployment type. The following code snippet shows a chat completions call to a GPT-4 model. For your first time using these models programmatically, we recommend starting with our [quickstart start guide](../quickstart.md). Our recommendation is to use the OpenAI library with version 1.0 or greater since this includes retry logic within the library.
71
+
The inferencing code for provisioned deployments is the same a standard deployment type. The following code snippet shows a chat completions call to a GPT-4 model. For your first time using these models programmatically, we recommend starting with our [quickstart guide](../quickstart.md). Our recommendation is to use the OpenAI library with version 1.0 or greater since this includes retry logic within the library.
72
72
73
73
74
74
```python
@@ -127,7 +127,7 @@ A 429 response indicates that the allocated PTUs are fully consumed at the time
127
127
The 429 signal isn't an unexpected error response when pushing to high utilization but instead part of the design for managing queuing and high load for provisioned deployments.
128
128
129
129
### Modifying retry logic within the client libraries
130
-
The Azure OpenAI SDKs retry 429 responses by default and behind the scenes in the client (up to the maximum retries). The libraries respect the `retry-after` time. You can also modify the retry behavior to better suite your experience. Here's an example with the python library.
130
+
The Azure OpenAI SDKs retry 429 responses by default and behind the scenes in the client (up to the maximum retries). The libraries respect the `retry-after` time. You can also modify the retry behavior to better suit your experience. Here's an example with the python library.
131
131
132
132
133
133
You can use the `max_retries` option to configure or disable retry settings:
0 commit comments