MicrosoftDocs
diff --git a/‎articles/ai-services/openai/concepts/provisioned-throughput.md
Lines changed: 7 additions & 7 deletions b/‎articles/ai-services/openai/concepts/provisioned-throughput.md
Lines changed: 7 additions & 7 deletions
diff --git a/‎articles/ai-services/openai/how-to/latency.md
Lines changed: 9 additions & 9 deletions b/‎articles/ai-services/openai/how-to/latency.md
Lines changed: 9 additions & 9 deletions
diff --git a/‎articles/ai-services/openai/how-to/provisioned-get-started.md
Lines changed: 5 additions & 5 deletions b/‎articles/ai-services/openai/how-to/provisioned-get-started.md
Lines changed: 5 additions & 5 deletions
@@ -25,24 +25,24 @@ The provisioned throughput capability allows you to specify the amount of throug
 An Azure OpenAI Deployment is a unit of management for a specific OpenAI Model. A deployment provides customer access to a model for inference and integrates more features like Content Moderation ([See content moderation documentation](content-filter.md)).
 
 > [!NOTE]
-> Provisioned throughput unit(PTU) quota is different from standard quota in Azure OpenAI and are not available by default. To learn more about this offering contact your Microsoft Account Team.
+> Provisioned throughput unit (PTU) quota is different from standard quota in Azure OpenAI and is not available by default. To learn more about this offering contact your Microsoft Account Team.
 
 ## What do you get?
 
 | Topic | Provisioned|
 |---|---|
-| What is it? | Provides guaranteed throughput at smaller increments than the existing provisioned offer. Deployments have a consistent max latency for a given model-version |
+| What is it? | Provides guaranteed throughput at smaller increments than the existing provisioned offer. Deployments have a consistent max latency for a given model-version. |
 | Who is it for? | Customers who want guaranteed throughput with minimal latency variance. |
-| Quota | Provisioned-managed throughput Units for a given model |
+| Quota | Provisioned-managed throughput Units for a given model. |
 | Latency | Max latency constrained from the model. Overall latency is a factor of call shape.  |
-| Utilization | Provisioned-managed Utilization measure provided in Azure Monitor |
-| Estimating size | Provided calculator in the studio & benchmarking script |
+| Utilization | Provisioned-managed Utilization measure provided in Azure Monitor. |
+| Estimating size | Provided calculator in the studio & benchmarking script. |
 
 ## Key concepts
 
 ### Provisioned throughput units
 
-Provisioned throughput Units (PTU) are units of model processing capacity that customers you can reserve and deploy for processing prompts and generating completions. The minimum PTU deployment, increments, and processing capacity associated with each unit varies by model type & version.
+Provisioned throughput units (PTU) are units of model processing capacity that customers you can reserve and deploy for processing prompts and generating completions. The minimum PTU deployment, increments, and processing capacity associated with each unit varies by model type & version.
 
 ### Deployment types
 
@@ -57,7 +57,7 @@ az cognitiveservices account deployment create \
 --model-version 0613  \
 --model-format OpenAI \
 --sku-capacity 100 \
---sku-name Provisioned-Managed 
+--sku-name ProvisionedManaged 
 ```
 
 ### Quota
 
@@ -66,9 +66,9 @@ o	Generate fewer responses: The best_of & n parameters can greatly increase late
 In summary, reducing the number of tokens generated per request reduces the latency of each request.
 
 ### Streaming
-Setting `stream: true` in a request makes the service return tokens as soon as they're available, instead of waiting for the full sequence of tokens to be generated. It doesn't change the time to get all the tokens, but it reduces the time for first response. This aproach provides a better user experience since end-suers can read the response as it is generated. 
+Setting `stream: true` in a request makes the service return tokens as soon as they're available, instead of waiting for the full sequence of tokens to be generated. It doesn't change the time to get all the tokens, but it reduces the time for first response. This approach provides a better user experience since end-users can read the response as it is generated. 
 
-Streaming is also valuable with large calls that take a long time to process. Many clients and intermediary layers have timeouts on individual calls. Long generation calls might be cancelled due to client-side time outs. By streaming the data back, you can ensure incremental data is received. 
+Streaming is also valuable with large calls that take a long time to process. Many clients and intermediary layers have timeouts on individual calls. Long generation calls might be canceled due to client-side time outs. By streaming the data back, you can ensure incremental data is received. 
 
 
 
@@ -82,7 +82,7 @@ Streaming impacts perceived latency. With streaming enabled you receive tokens b
 
 Sentiment analysis, language translation, content generation.
 
-There are many use cases where you are performing some bulk task where you only care about the finished result, not the real-time response. If streaming is disabled, you won't receive any tokens until the model has finished the entire response.
+There are many use cases where you're performing some bulk task where you only care about the finished result, not the real-time response. If streaming is disabled, you won't receive any tokens until the model has finished the entire response.
 
 ### Content filtering
 
@@ -96,23 +96,23 @@ Learn more about requesting modifications to the default, [content filtering pol
 
 
 ### Separation of workloads
-Mixing different workloads on the same endpoint can negatively affect latency. This is because (1) they are batched together during inference and short calls can be waiting for longer completions and (2) mixing the calls can reduce your cache hit rate as they are both competing for the same space. When possible, it is recommended to have separate deployments for each workload.
+Mixing different workloads on the same endpoint can negatively affect latency. This is because (1) they're batched together during inference and short calls can be waiting for longer completions and (2) mixing the calls can reduce your cache hit rate as they're both competing for the same space. When possible, it's recommended to have separate deployments for each workload.
 
 ### Prompt Size
-While prompt size has smaller affect on latency than the generation size it will affect the overall time, especially when the size grows large. 
+While prompt size has smaller influence on latency than the generation size it affects the overall time, especially when the size grows large. 
 
 ### Batching
-If you are sending multiple requests to the same endpoint, you can batch the requests into a single call. This will reduce the number of requests you need to make and depending on the scenario it might improve overall response time. We recommend testing this method to see if it helps. 
+If you're sending multiple requests to the same endpoint, you can batch the requests into a single call. This reduces the number of requests you need to make and depending on the scenario it might improve overall response time. We recommend testing this method to see if it helps. 
 
 ## How to measure your throughput
 We recommend measuring your overall throughput on a deployment with two measures:
--	Calls per minute: The number of API inference calls you are making per minute. This can be measured in Azure-monitor using the Azure OpenAI Requests metric and splitting by the ModelDeploymentName
+-	Calls per minute: The number of API inference calls you're making per minute. This can be measured in Azure-monitor using the Azure OpenAI Requests metric and splitting by the ModelDeploymentName
 -	Total Tokens per minute: The total number of tokens being processed per minute by your deployment. This includes prompt & generated tokens. This is often further split into measuring both for a deeper understanding of deployment performance. This can be measured in Azure-Monitor using the Processed Inference tokens metric. 
 
 You can learn more about [Monitoring the Azure OpenAI Service](./monitoring.md).
 
 ## How to measure per-call latency
-The time it takes for each call depends on how long it takes to read the model, generate the output, and apply content filters. The way you measure the time will vary if you are using streaming or not. We suggest a different set of measures for each case. 
+The time it takes for each call depends on how long it takes to read the model, generate the output, and apply content filters. The way you measure the time will vary if you're using streaming or not. We suggest a different set of measures for each case. 
 
 You can learn more about [Monitoring the Azure OpenAI Service](./monitoring.md).
 
@@ -128,7 +128,7 @@ Time from the first token to the last token, divided by the number of generated
 
 ## Summary
 
-* **Model latency**: If model latency is important to you we recommend trying out our latest models in the [GPT-3.5 Turbo model series](../concepts/models.md).
+* **Model latency**: If model latency is important to you, we recommend trying out our latest models in the [GPT-3.5 Turbo model series](../concepts/models.md).
 
 * **Lower max tokens**: OpenAI has found that even in cases where the total number of tokens generated is similar the request with the higher value set for the max token parameter will have more latency.
 
 
@@ -46,7 +46,7 @@ After you purchase a commitment on your quota, you can create a deployment. To c
 | Select a model|	Choose the specific model you wish to deploy.	| GPT-4 |
 | Model version |	Choose the version of the model to deploy.	 | 0613 |
 | Deployment Name	 | The deployment name is used in your code to call the model by using the client libraries and the REST APIs.	| gpt-4|
-| Content filter	| Specify the filtering policy to apply to the deployment. Learn more on our [Content Filtering](../concepts/content-filter.md) how-tow | 	Default |
+| Content filter	| Specify the filtering policy to apply to the deployment. Learn more on our [Content Filtering](../concepts/content-filter.md) how-to. | 	Default |
 | Deployment Type	|This impacts the throughput and performance. Choose Provisioned-Managed for your provisioned deployment 	| Provisioned-Managed |
 | Provisioned Throughput Units |	Choose the amount of throughput you wish to include in the deployment. |	100 |
 
@@ -62,13 +62,13 @@ az cognitiveservices account deployment create \
 --model-version 0613  \
 --model-format OpenAI \
 --sku-capacity 100 \
---sku-name Provisioned-Managed
+--sku-name ProvisionedManaged
 ```
 
-REST, ARM template, Bicep and Terraform can also be used to create deployments. See the section on automating deployments in the [Managing Quota](https://learn.microsoft.com/azure/ai-services/openai/how-to/quota?tabs=rest#automate-deployment) how-to guide and replace the `sku.name` with "Provisioned-Managed" rather than "Standard." 
+REST, ARM template, Bicep and Terraform can also be used to create deployments. See the section on automating deployments in the [Managing Quota](https://learn.microsoft.com/azure/ai-services/openai/how-to/quota?tabs=rest#automate-deployment) how-to guide and replace the `sku.name` with "ProvisionedManaged" rather than "Standard."
 
 ## Make your first calls
-The inferencing code for provisioned deployments is the same a standard deployment type. The following code snippet shows a chat completions call to a GPT-4 model.  For your first time using these models programmatically, we recommend starting with our [quickstart start guide](../quickstart.md). Our recommendation is to use the OpenAI library with version 1.0 or greater since this includes retry logic within the library.
+The inferencing code for provisioned deployments is the same a standard deployment type. The following code snippet shows a chat completions call to a GPT-4 model.  For your first time using these models programmatically, we recommend starting with our [quickstart guide](../quickstart.md). Our recommendation is to use the OpenAI library with version 1.0 or greater since this includes retry logic within the library.
 
 
 ```python
@@ -127,7 +127,7 @@ A 429 response indicates that the allocated PTUs are fully consumed at the time
 The 429 signal isn't an unexpected error response when pushing to high utilization but instead part of the design for managing queuing and high load for provisioned deployments. 
 
 ### Modifying retry logic within the client libraries
-The Azure OpenAI SDKs retry 429 responses by default and behind the scenes in the client (up to the maximum retries). The libraries respect the `retry-after` time. You can also modify the retry behavior to better suite your experience. Here's an example with the python library. 
+The Azure OpenAI SDKs retry 429 responses by default and behind the scenes in the client (up to the maximum retries). The libraries respect the `retry-after` time. You can also modify the retry behavior to better suit your experience. Here's an example with the python library. 
 
 
 You can use the `max_retries` option to configure or disable retry settings: