Skip to content

Commit bdc8a1c

Browse files
committed
update
1 parent e41ba34 commit bdc8a1c

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

articles/ai-services/openai/how-to/latency.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ When you send a completion request to the Azure OpenAI endpoint, your input text
6161
At the time of the request, the requested generation size (max_tokens parameter) is used as an initial estimate of the generation size. The compute-time for generating the full size is reserved by the model as the request is processed. Once the generation is completed, the remaining quota is released. Ways to reduce the number of tokens:
6262
- Set the `max_token` parameter on each call as small as possible.
6363
- Include stop sequences to prevent generating extra content.
64-
- Generate fewer responses: The best_of & n parameters can greatly increase latency because they generate multiple outputs. For the fastest response, either don't specify these values or set them to 1.
64+
- Generate fewer responses: The best_of & n parameters can greatly increase latency because they generate multiple outputs. For the fastest response, either don't specify these values or set them to 1.
6565

6666
In summary, reducing the number of tokens generated per request reduces the latency of each request.
6767

0 commit comments

Comments
 (0)