Skip to content

Commit 92e8663

Browse files
Merge pull request #265574 from mrbullwinkle/mrb_02_07_2024_latency
[Azure OpenAI] Markdown syntax and typo fix
2 parents 54a21ae + bdc8a1c commit 92e8663

File tree

1 file changed

+6
-6
lines changed

1 file changed

+6
-6
lines changed

articles/ai-services/openai/how-to/latency.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ description: Learn about performance and latency with Azure OpenAI
55
manager: nitinme
66
ms.service: azure-ai-openai
77
ms.topic: how-to
8-
ms.date: 11/21/2023
8+
ms.date: 02/07/2024
99
author: mrbullwinkle
1010
ms.author: mbullwin
1111
recommendations: false
@@ -58,17 +58,17 @@ Latency varies based on what model you're using. For an identical request, expec
5858

5959
When you send a completion request to the Azure OpenAI endpoint, your input text is converted to tokens that are then sent to your deployed model. The model receives the input tokens and then begins generating a response. It's an iterative sequential process, one token at a time. Another way to think of it is like a for loop with `n tokens = n iterations`. For most models, generating the response is the slowest step in the process.
6060

61-
At the time of the request, the requested generation size (max_tokens parameter) is used as an initial estimate of the generation size. The compute-time for generating the full size is reserved the model as the request is processed. Once the generation is completed, the remaining quota is released. Ways to reduce the number of tokens:
62-
o Set the `max_token` parameter on each call as small as possible.
63-
o Include stop sequences to prevent generating extra content.
64-
o Generate fewer responses: The best_of & n parameters can greatly increase latency because they generate multiple outputs. For the fastest response, either don't specify these values or set them to 1.
61+
At the time of the request, the requested generation size (max_tokens parameter) is used as an initial estimate of the generation size. The compute-time for generating the full size is reserved by the model as the request is processed. Once the generation is completed, the remaining quota is released. Ways to reduce the number of tokens:
62+
- Set the `max_token` parameter on each call as small as possible.
63+
- Include stop sequences to prevent generating extra content.
64+
- Generate fewer responses: The best_of & n parameters can greatly increase latency because they generate multiple outputs. For the fastest response, either don't specify these values or set them to 1.
6565

6666
In summary, reducing the number of tokens generated per request reduces the latency of each request.
6767

6868
### Streaming
6969
Setting `stream: true` in a request makes the service return tokens as soon as they're available, instead of waiting for the full sequence of tokens to be generated. It doesn't change the time to get all the tokens, but it reduces the time for first response. This approach provides a better user experience since end-users can read the response as it is generated.
7070

71-
Streaming is also valuable with large calls that take a long time to process. Many clients and intermediary layers have timeouts on individual calls. Long generation calls might be canceled due to client-side time outs. By streaming the data back, you can ensure incremental data is received.
71+
Streaming is also valuable with large calls that take a long time to process. Many clients and intermediary layers have timeouts on individual calls. Long generation calls might be canceled due to client-side time outs. By streaming the data back, you can ensure incremental data is received.
7272

7373

7474

0 commit comments

Comments
 (0)