You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-services/openai/how-to/predicted-outputs.md
+17-17Lines changed: 17 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,7 +14,7 @@ recommendations: false
14
14
15
15
# Predicted outputs
16
16
17
-
Predicted outputs can improve model response latency for chat completions calls where minimal changes are needed to a larger body of text. If you are asking the model to provide a response where a large portion of the expected response is already known, predicted outputs can significantly reduce the latency of this request. This capability is particularly well-suited for coding scenarios, including autocomplete, error detection, and real-time editing, where speed and responsiveness are critical for developers and end-users. Rather than have the model regenerate all the text from scratch, you can indicate to the model that most of the response is already known by passing the known text to the `prediction` parameter.
17
+
Predicted outputs can improve model response latency for chat completions calls where minimal changes are needed to a larger body of text. If you're asking the model to provide a response where a large portion of the expected response is already known, predicted outputs can significantly reduce the latency of this request. This capability is particularly well-suited for coding scenarios, including autocomplete, error detection, and real-time editing, where speed and responsiveness are critical for developers and end-users. Rather than have the model regenerate all the text from scratch, you can indicate to the model that most of the response is already known by passing the known text to the `prediction` parameter.
18
18
19
19
## Model support
20
20
@@ -29,26 +29,26 @@ Predicted outputs can improve model response latency for chat completions calls
29
29
30
30
## Unsupported features
31
31
32
-
Predicted outputs is currently text-only. These features cannot be used in conjunction with the `prediction` parameter and predicted outputs.
32
+
Predicted outputs is currently text-only. These features can't be used in conjunction with the `prediction` parameter and predicted outputs.
33
33
34
-
- Function calling
34
+
-Tools/Function calling
35
35
- audio models/inputs and outputs
36
-
-`n` values higher than 1
36
+
-`n` values higher than `1`
37
37
-`logprobs`
38
-
-`presence_penalty` values greater than 0
39
-
-`frequency_penalty` values greater than 0
38
+
-`presence_penalty` values greater than `0`
39
+
-`frequency_penalty` values greater than `0`
40
40
-`max_completion_tokens`
41
41
42
42
> [!NOTE]
43
43
> The predicted outputs feature is currently unavailable for models in the South East Asia region.
44
44
45
45
## Getting started
46
46
47
-
To demonstrate the basics of predicted outputs we'll start by asking a model to refactor the code from the common programming `FizzBuzz` problem to replace the instance of `FizzBuzz` with `MSFTBuzz`. We'll pass our example code to the model in two places. First as part of a user message in the `messages` array/list, and a second time as part of the content of the new `prediction` parameter.
47
+
To demonstrate the basics of predicted outputs, we'll start by asking a model to refactor the code from the common programming `FizzBuzz` problem to replace the instance of `FizzBuzz` with `MSFTBuzz`. We'll pass our example code to the model in two places. First as part of a user message in the `messages` array/list, and a second time as part of the content of the new `prediction` parameter.
You may need to upgrade your OpenAI client library to access the `prediction` parameter.
113
+
You might need to upgrade your OpenAI client library to access the `prediction` parameter.
114
114
115
115
```cmd
116
116
pip install openai --upgrade
@@ -278,20 +278,20 @@ Notice in the output the new response parameters for `accepted_prediction_tokens
278
278
}
279
279
```
280
280
281
-
The `accepted_prediction_tokens` help reduce model response latency, but any `rejected_prediction_tokens` have the same cost implication as additional output tokens generated by the model. For this reason, while predicted outputs can improve model response times, it can result in greater costs. You will need to evaluate and balance the increased model performance against the potential increases in cost.
281
+
The `accepted_prediction_tokens` help reduce model response latency, but any `rejected_prediction_tokens` have the same cost implication as additional output tokens generated by the model. For this reason, while predicted outputs can improve model response times, it can result in greater costs. You'll need to evaluate and balance the increased model performance against the potential increases in cost.
282
282
283
-
It is also important to understand, that using predictive outputs does not guarantee a reduction in latency. A large request with a greater percentage of rejected prediction tokens than accepted prediction tokens could result in an increase in model response latency, rather than a decrease.
283
+
It's also important to understand, that using predictive outputs doesn't guarantee a reduction in latency. A large request with a greater percentage of rejected prediction tokens than accepted prediction tokens could result in an increase in model response latency, rather than a decrease.
284
284
285
285
> [!NOTE]
286
-
> Unlike [prompt caching](./prompt-caching.md) which only works when a set minimum number of initial tokens at the beginning of a request are identical, predicted outputs is not constrained by token location. Even if your response text contains new output that will be returned prior to the predicted output, `accepted_prediction_tokens` can still occur.
286
+
> Unlike [prompt caching](./prompt-caching.md) which only works when a set minimum number of initial tokens at the beginning of a request are identical, predicted outputs isn't constrained by token location. Even if your response text contains new output that will be returned prior to the predicted output, `accepted_prediction_tokens` can still occur.
287
287
288
288
## Streaming
289
289
290
-
Predicted outputs performance boost is often most obvious if you are returning your responses with streaming enabled.
290
+
Predicted outputs performance boost is often most obvious if you're returning your responses with streaming enabled.
0 commit comments