Skip to content

Commit 77e03d7

Browse files
committed
small fixes
1 parent c04289f commit 77e03d7

File tree

5 files changed

+39
-39
lines changed

5 files changed

+39
-39
lines changed

articles/ai-services/openai/concepts/provisioned-throughput.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -25,24 +25,24 @@ The provisioned throughput capability allows you to specify the amount of throug
2525
An Azure OpenAI Deployment is a unit of management for a specific OpenAI Model. A deployment provides customer access to a model for inference and integrates more features like Content Moderation ([See content moderation documentation](content-filter.md)).
2626

2727
> [!NOTE]
28-
> Provisioned throughput unit(PTU) quota is different from standard quota in Azure OpenAI and are not available by default. To learn more about this offering contact your Microsoft Account Team.
28+
> Provisioned throughput unit (PTU) quota is different from standard quota in Azure OpenAI and is not available by default. To learn more about this offering contact your Microsoft Account Team.
2929
3030
## What do you get?
3131

3232
| Topic | Provisioned|
3333
|---|---|
34-
| What is it? | Provides guaranteed throughput at smaller increments than the existing provisioned offer. Deployments have a consistent max latency for a given model-version |
34+
| What is it? | Provides guaranteed throughput at smaller increments than the existing provisioned offer. Deployments have a consistent max latency for a given model-version. |
3535
| Who is it for? | Customers who want guaranteed throughput with minimal latency variance. |
36-
| Quota | Provisioned-managed throughput Units for a given model |
36+
| Quota | Provisioned-managed throughput Units for a given model. |
3737
| Latency | Max latency constrained from the model. Overall latency is a factor of call shape. |
38-
| Utilization | Provisioned-managed Utilization measure provided in Azure Monitor |
39-
| Estimating size | Provided calculator in the studio & benchmarking script |
38+
| Utilization | Provisioned-managed Utilization measure provided in Azure Monitor. |
39+
| Estimating size | Provided calculator in the studio & benchmarking script. |
4040

4141
## Key concepts
4242

4343
### Provisioned throughput units
4444

45-
Provisioned throughput Units (PTU) are units of model processing capacity that customers you can reserve and deploy for processing prompts and generating completions. The minimum PTU deployment, increments, and processing capacity associated with each unit varies by model type & version.
45+
Provisioned throughput units (PTU) are units of model processing capacity that customers you can reserve and deploy for processing prompts and generating completions. The minimum PTU deployment, increments, and processing capacity associated with each unit varies by model type & version.
4646

4747
### Deployment types
4848

@@ -57,7 +57,7 @@ az cognitiveservices account deployment create \
5757
--model-version 0613 \
5858
--model-format OpenAI \
5959
--sku-capacity 100 \
60-
--sku-name Provisioned-Managed
60+
--sku-name ProvisionedManaged
6161
```
6262

6363
### Quota

articles/ai-services/openai/how-to/latency.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -66,9 +66,9 @@ o Generate fewer responses: The best_of & n parameters can greatly increase late
6666
In summary, reducing the number of tokens generated per request reduces the latency of each request.
6767

6868
### Streaming
69-
Setting `stream: true` in a request makes the service return tokens as soon as they're available, instead of waiting for the full sequence of tokens to be generated. It doesn't change the time to get all the tokens, but it reduces the time for first response. This aproach provides a better user experience since end-suers can read the response as it is generated.
69+
Setting `stream: true` in a request makes the service return tokens as soon as they're available, instead of waiting for the full sequence of tokens to be generated. It doesn't change the time to get all the tokens, but it reduces the time for first response. This approach provides a better user experience since end-users can read the response as it is generated.
7070

71-
Streaming is also valuable with large calls that take a long time to process. Many clients and intermediary layers have timeouts on individual calls. Long generation calls might be cancelled due to client-side time outs. By streaming the data back, you can ensure incremental data is received.
71+
Streaming is also valuable with large calls that take a long time to process. Many clients and intermediary layers have timeouts on individual calls. Long generation calls might be canceled due to client-side time outs. By streaming the data back, you can ensure incremental data is received.
7272

7373

7474

@@ -82,7 +82,7 @@ Streaming impacts perceived latency. With streaming enabled you receive tokens b
8282

8383
Sentiment analysis, language translation, content generation.
8484

85-
There are many use cases where you are performing some bulk task where you only care about the finished result, not the real-time response. If streaming is disabled, you won't receive any tokens until the model has finished the entire response.
85+
There are many use cases where you're performing some bulk task where you only care about the finished result, not the real-time response. If streaming is disabled, you won't receive any tokens until the model has finished the entire response.
8686

8787
### Content filtering
8888

@@ -96,23 +96,23 @@ Learn more about requesting modifications to the default, [content filtering pol
9696

9797

9898
### Separation of workloads
99-
Mixing different workloads on the same endpoint can negatively affect latency. This is because (1) they are batched together during inference and short calls can be waiting for longer completions and (2) mixing the calls can reduce your cache hit rate as they are both competing for the same space. When possible, it is recommended to have separate deployments for each workload.
99+
Mixing different workloads on the same endpoint can negatively affect latency. This is because (1) they're batched together during inference and short calls can be waiting for longer completions and (2) mixing the calls can reduce your cache hit rate as they're both competing for the same space. When possible, it's recommended to have separate deployments for each workload.
100100

101101
### Prompt Size
102-
While prompt size has smaller affect on latency than the generation size it will affect the overall time, especially when the size grows large.
102+
While prompt size has smaller influence on latency than the generation size it affects the overall time, especially when the size grows large.
103103

104104
### Batching
105-
If you are sending multiple requests to the same endpoint, you can batch the requests into a single call. This will reduce the number of requests you need to make and depending on the scenario it might improve overall response time. We recommend testing this method to see if it helps.
105+
If you're sending multiple requests to the same endpoint, you can batch the requests into a single call. This reduces the number of requests you need to make and depending on the scenario it might improve overall response time. We recommend testing this method to see if it helps.
106106

107107
## How to measure your throughput
108108
We recommend measuring your overall throughput on a deployment with two measures:
109-
- Calls per minute: The number of API inference calls you are making per minute. This can be measured in Azure-monitor using the Azure OpenAI Requests metric and splitting by the ModelDeploymentName
109+
- Calls per minute: The number of API inference calls you're making per minute. This can be measured in Azure-monitor using the Azure OpenAI Requests metric and splitting by the ModelDeploymentName
110110
- Total Tokens per minute: The total number of tokens being processed per minute by your deployment. This includes prompt & generated tokens. This is often further split into measuring both for a deeper understanding of deployment performance. This can be measured in Azure-Monitor using the Processed Inference tokens metric.
111111

112112
You can learn more about [Monitoring the Azure OpenAI Service](./monitoring.md).
113113

114114
## How to measure per-call latency
115-
The time it takes for each call depends on how long it takes to read the model, generate the output, and apply content filters. The way you measure the time will vary if you are using streaming or not. We suggest a different set of measures for each case.
115+
The time it takes for each call depends on how long it takes to read the model, generate the output, and apply content filters. The way you measure the time will vary if you're using streaming or not. We suggest a different set of measures for each case.
116116

117117
You can learn more about [Monitoring the Azure OpenAI Service](./monitoring.md).
118118

@@ -128,7 +128,7 @@ Time from the first token to the last token, divided by the number of generated
128128

129129
## Summary
130130

131-
* **Model latency**: If model latency is important to you we recommend trying out our latest models in the [GPT-3.5 Turbo model series](../concepts/models.md).
131+
* **Model latency**: If model latency is important to you, we recommend trying out our latest models in the [GPT-3.5 Turbo model series](../concepts/models.md).
132132

133133
* **Lower max tokens**: OpenAI has found that even in cases where the total number of tokens generated is similar the request with the higher value set for the max token parameter will have more latency.
134134

articles/ai-services/openai/how-to/provisioned-get-started.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ After you purchase a commitment on your quota, you can create a deployment. To c
4646
| Select a model| Choose the specific model you wish to deploy. | GPT-4 |
4747
| Model version | Choose the version of the model to deploy. | 0613 |
4848
| Deployment Name | The deployment name is used in your code to call the model by using the client libraries and the REST APIs. | gpt-4|
49-
| Content filter | Specify the filtering policy to apply to the deployment. Learn more on our [Content Filtering](../concepts/content-filter.md) how-tow | Default |
49+
| Content filter | Specify the filtering policy to apply to the deployment. Learn more on our [Content Filtering](../concepts/content-filter.md) how-to. | Default |
5050
| Deployment Type |This impacts the throughput and performance. Choose Provisioned-Managed for your provisioned deployment | Provisioned-Managed |
5151
| Provisioned Throughput Units | Choose the amount of throughput you wish to include in the deployment. | 100 |
5252

@@ -62,13 +62,13 @@ az cognitiveservices account deployment create \
6262
--model-version 0613 \
6363
--model-format OpenAI \
6464
--sku-capacity 100 \
65-
--sku-name Provisioned-Managed
65+
--sku-name ProvisionedManaged
6666
```
6767

68-
REST, ARM template, Bicep and Terraform can also be used to create deployments. See the section on automating deployments in the [Managing Quota](https://learn.microsoft.com/azure/ai-services/openai/how-to/quota?tabs=rest#automate-deployment) how-to guide and replace the `sku.name` with "Provisioned-Managed" rather than "Standard."
68+
REST, ARM template, Bicep and Terraform can also be used to create deployments. See the section on automating deployments in the [Managing Quota](https://learn.microsoft.com/azure/ai-services/openai/how-to/quota?tabs=rest#automate-deployment) how-to guide and replace the `sku.name` with "ProvisionedManaged" rather than "Standard."
6969

7070
## Make your first calls
71-
The inferencing code for provisioned deployments is the same a standard deployment type. The following code snippet shows a chat completions call to a GPT-4 model. For your first time using these models programmatically, we recommend starting with our [quickstart start guide](../quickstart.md). Our recommendation is to use the OpenAI library with version 1.0 or greater since this includes retry logic within the library.
71+
The inferencing code for provisioned deployments is the same a standard deployment type. The following code snippet shows a chat completions call to a GPT-4 model. For your first time using these models programmatically, we recommend starting with our [quickstart guide](../quickstart.md). Our recommendation is to use the OpenAI library with version 1.0 or greater since this includes retry logic within the library.
7272

7373

7474
```python
@@ -127,7 +127,7 @@ A 429 response indicates that the allocated PTUs are fully consumed at the time
127127
The 429 signal isn't an unexpected error response when pushing to high utilization but instead part of the design for managing queuing and high load for provisioned deployments.
128128

129129
### Modifying retry logic within the client libraries
130-
The Azure OpenAI SDKs retry 429 responses by default and behind the scenes in the client (up to the maximum retries). The libraries respect the `retry-after` time. You can also modify the retry behavior to better suite your experience. Here's an example with the python library.
130+
The Azure OpenAI SDKs retry 429 responses by default and behind the scenes in the client (up to the maximum retries). The libraries respect the `retry-after` time. You can also modify the retry behavior to better suit your experience. Here's an example with the python library.
131131

132132

133133
You can use the `max_retries` option to configure or disable retry settings:

0 commit comments

Comments
 (0)