resolve conflict

laujan · laujan · commit 1d5d4becebb6 · 2024-11-26T14:10:10.000-08:00
diff --git a/articles/ai-services/openai/concepts/provisioned-throughput.md b/articles/ai-services/openai/concepts/provisioned-throughput.md
@@ -163,17 +163,17 @@ For Provisioned-Managed and Global Provisioned-Managed, we use a variation of th
 1. Each customer has a set amount of capacity they can utilize on a deployment
 1. When a request is made:
 
-   a.    When the current utilization is above 100%, the service returns a 429 code with the `retry-after-ms` header set to the time until utilization is below 100%
+    a.    When the current utilization is above 100%, the service returns a 429 code with the `retry-after-ms` header set to the time until utilization is below 100%
 
-   b.    Otherwise, the service estimates the incremental change to utilization required to serve the request by combining prompt tokens and the specified `max_tokens` in the call. For requests that include at least 1024 cached tokens, the cached tokens are subtracted from the prompt token value. A customer can receive up to a 100% discount on their prompt tokens depending on the size of their cached tokens. If the `max_tokens` parameter is not specified, the service estimates a value. This estimation can lead to lower concurrency than expected when the number of actual generated tokens is small.  For highest concurrency, ensure that the `max_tokens` value is as close as possible to the true generation size.
+    b.    Otherwise, the service estimates the incremental change to utilization required to serve the request by combining prompt tokens and the specified `max_tokens` in the call. For requests that include at least 1024 cached tokens, the cached tokens are subtracted from the prompt token value. A customer can receive up to a 100% discount on their prompt tokens depending on the size of their cached tokens. If the `max_tokens` parameter is not specified, the service estimates a value. This estimation can lead to lower concurrency than expected when the number of actual generated tokens is small.  For highest concurrency, ensure that the `max_tokens` value is as close as possible to the true generation size.
 
-1.    When a request finishes, we now know the actual compute cost for the call. To ensure an accurate accounting, we correct the utilization using the following logic:
+1.  When a request finishes, we now know the actual compute cost for the call. To ensure an accurate accounting, we correct the utilization using the following logic:
 
-  a.  If the actual > estimated, then the difference is added to the deployment's utilization.
+    a.    If the actual > estimated, then the difference is added to the deployment's utilization.
 
-  b.  If the actual < estimated, then the difference is subtracted.
+    b.    If the actual < estimated, then the difference is subtracted.
 
-1.    The overall utilization is decremented down at a continuous rate based on the number of PTUs deployed.
+1.  The overall utilization is decremented down at a continuous rate based on the number of PTUs deployed.
 
 > [!NOTE]
 > Calls are accepted until utilization reaches 100%. Bursts just over 100% may be permitted in short periods, but over time, your traffic is capped at 100% utilization.