|
6 | 6 | "source": [ |
7 | 7 | "## Better performance from reasoning models using the Responses API \n", |
8 | 8 | "\n", |
9 | | - "We've recently released two new state-of-the-art reasoning models, o3 and o4-mini, that excel at combining reasoning capabilities with agentic tool use. What a lot of folks don't know is that you can improve their performance by fully leveraging our (relatively) new Responses API. This cookbook aims to demonstrate how you might be able to get the most of the two models and dive a little deeper on the details on how reasoning and function calling works for these models behind the scenes. By giving the model access to previous reasoning items, we can ensure make sure it is operating at maximum model intelligence and lowest cost. \n" |
| 9 | + "We've recently released two new state-of-the-art reasoning models, o3 and o4-mini, that excel at combining reasoning capabilities with agentic tool use. What a lot of folks don't know is that you can improve their performance by fully leveraging our (relatively) new Responses API. This cookbook aims to demonstrate how you might be able to get the most out of the two models and dive a little deeper into the details of how reasoning and function calling work for these models behind the scenes. By giving the model access to previous reasoning items, we can ensure it is operating at maximum model intelligence and lowest cost.\n" |
10 | 10 | ] |
11 | 11 | }, |
12 | 12 | { |
13 | 13 | "cell_type": "markdown", |
14 | 14 | "metadata": {}, |
15 | 15 | "source": [ |
16 | | - "We've introduced the Responses API during its launch with a separate [cookbook](https://cookbook.openai.com/examples/responses_api/responses_example) along with the [API reference](https://platform.openai.com/docs/api-reference/responses). The short takeaway is that by design the Responses API isn't that different from the Completions API with a few improvements and added features. We've recently rolled out encrypted content for Responses, which we will also get into here, which will make it even more useful for folks who cannoot use Responses API in a stateful way!" |
| 16 | + "We've introduced the Responses API during its launch with a separate [cookbook](https://cookbook.openai.com/examples/responses_api/responses_example) along with the [API reference](https://platform.openai.com/docs/api-reference/responses). The short takeaway is that by design the Responses API isn't that different from the Completions API with a few improvements and added features. We've recently rolled out encrypted content for Responses, which we will also get into here, which will make it even more useful for folks who cannot use Responses API in a stateful way!" |
17 | 17 | ] |
18 | 18 | }, |
19 | 19 | { |
|
22 | 22 | "source": [ |
23 | 23 | "## How Reasoning Models work\n", |
24 | 24 | "\n", |
25 | | - "Before we dive into how Responses API can help us, it is useful for us to first review how [reasoning models](https://platform.openai.com/docs/guides/reasoning?api-mode=responses) work behind the scenes. Reasoning models like o3 and o4-mini takes time to think through a problem before answering. Through this thinking process, the model is able to break a complex problem down and work it through step by step, increasing its performance on these tasks. During the thinking process, the models produces a long internal chain of thought that encodes the reasoning logic for the problem. For safety reasons, the reasoning tokens are only exposed to end suers in summarized rather than raw forms. " |
| 25 | + "Before we dive into how the Responses API can help us, it is useful to first review how [reasoning models](https://platform.openai.com/docs/guides/reasoning?api-mode=responses) work behind the scenes. Reasoning models like o3 and o4-mini take time to think through a problem before answering. Through this thinking process, the model is able to break a complex problem down and work through it step by step, increasing its performance on these tasks. During the thinking process, the models produce a long internal chain of thought that encodes the reasoning logic for the problem. For safety reasons, the reasoning tokens are only exposed to end users in summarized form rather than in raw form." |
26 | 26 | ] |
27 | 27 | }, |
28 | 28 | { |
|
151 | 151 | "cell_type": "markdown", |
152 | 152 | "metadata": {}, |
153 | 153 | "source": [ |
154 | | - "You can see that from the json dump of the response object, that in addition to the `output_text`, there is a reasoning item that was also produced from this single API call. This represent the reasoning tokens produced by the model. By defualt, it is exposed as an id, in this instance here it is `rs_6820f383d7c08191846711c5df8233bc0ac5ba57aafcbac7`. Since the Responses API is stateful as well, the reasoning token is persisted - all you have to do is to include these items along with their associated id's in subsequent messages for subsequent response to have access to the same reasoning items. If you use `previous_response_id` for multi-turn conversations, the model will also have access to all the reasoning items produced previously.\n", |
| 154 | + "You can see from the JSON dump of the response object that, in addition to the `output_text`, there is a reasoning item that was also produced from this single API call. This represents the reasoning tokens produced by the model. By default, it is exposed as an ID; in this instance, it is `rs_6820f383d7c08191846711c5df8233bc0ac5ba57aafcbac7`. Since the Responses API is stateful as well, the reasoning token is persisted—all you have to do is include these items along with their associated IDs in subsequent messages for subsequent responses to have access to the same reasoning items. If you use `previous_response_id` for multi-turn conversations, the model will also have access to all the reasoning items produced previously.\n", |
155 | 155 | "\n", |
156 | | - "Note, you can see how much reasoning token the model has produced from this response. With a total # of 10 input tokens, we produced 148 output tokens, of which 128 are reasoning tokens that you don't see from the final assistant message." |
| 156 | + "Note, you can see how many reasoning tokens the model has produced from this response. With a total of 10 input tokens, we produced 148 output tokens, of which 128 are reasoning tokens that you don't see in the final assistant message." |
157 | 157 | ] |
158 | 158 | }, |
159 | 159 | { |
160 | 160 | "cell_type": "markdown", |
161 | 161 | "metadata": {}, |
162 | 162 | "source": [ |
163 | | - "But wait! From the above diagram, didn't you say that reasoning from previous turns are discarded? Then why does passing it back in matter for subsequent turns? \n", |
| 163 | + "But wait! From the above diagram, didn’t you say that reasoning from previous turns is discarded? Then why does passing it back in matter for subsequent turns?\n", |
164 | 164 | "\n", |
165 | | - "If you've been paying attention, you probably have that question. That is a great question -- For normal multi-turn conversations, the inclusion of reasoning items and tokens are not necessary - the model is trained so that it does not need the reasoning tokens from previous turns to produce the best output. This changes when we consider the possibility of tool use. When we talk about a single turn, the turn may include function calls as well - despite the fact that it may involve an additional round trip outside of the API. In this instance, it is necessary to include the reasoning items (either via `previous_response_id` or explicitly including the reasoning item in `input`). To illustrate this, let's cook up a quick function calling example." |
| 165 | + "If you’ve been paying attention, you probably have that question. That is a great question. For normal multi-turn conversations, the inclusion of reasoning items and tokens is not necessary—the model is trained so that it does not need the reasoning tokens from previous turns to produce the best output. This changes when we consider the possibility of tool use. When we talk about a single turn, the turn may include function calls as well, even though it may involve an additional round trip outside of the API. In this instance, it is necessary to include the reasoning items (either via `previous_response_id` or by explicitly including the reasoning item in `input`). To illustrate this, let’s create a quick function-calling example." |
166 | 166 | ] |
167 | 167 | }, |
168 | 168 | { |
|
223 | 223 | "cell_type": "markdown", |
224 | 224 | "metadata": {}, |
225 | 225 | "source": [ |
226 | | - "Here we see that after reasoning for a bit, the o4-mini has decided that it needs additional information which it can obtain from calling a function, which we can go ahead and call and pass the output back to the model. The important thing to note here is that in order for the model have the maximum intelligence, we need to pass the reasoning item back, which one call do simply by adding all of the output back into the context being passed back." |
| 226 | + "Here we see that after reasoning for a bit, the o4-mini model has decided that it needs additional information, which it can obtain by calling a function. We can go ahead and call the function and pass the output back to the model. The important thing to note here is that, in order for the model to have maximum intelligence, we need to pass the reasoning item back, which one can do simply by adding all of the output back into the context being passed back." |
227 | 227 | ] |
228 | 228 | }, |
229 | 229 | { |
|
269 | 269 | "cell_type": "markdown", |
270 | 270 | "metadata": {}, |
271 | 271 | "source": [ |
272 | | - "It is hard to illustrate the improved model intelligence in this toy example since the model will probably still do the right thing with or without the reasoning item being included so we ran some tests ourselves: In a more comprehensive benchmark like SWE-bench, we were able to get about **3% improvement** by including the reasoning items for the same prompt and setup." |
| 272 | + "It is hard to illustrate the improved model intelligence in this toy example, since the model will probably still do the right thing with or without the reasoning item being included. So we ran some tests ourselves: in a more comprehensive benchmark like SWE-bench, we were able to get about **3% improvement** by including the reasoning items for the same prompt and setup." |
273 | 273 | ] |
274 | 274 | }, |
275 | 275 | { |
276 | 276 | "cell_type": "markdown", |
277 | 277 | "metadata": {}, |
278 | 278 | "source": [ |
279 | 279 | "## Caching\n", |
280 | | - "As illustrated above, reasoning models produce both reasoning tokens and completion tokens that are treatead different in the API today. This also has implications for cache utilization and latency. To illustrate the point, we include this helpful sketch.\n", |
| 280 | + "As illustrated above, reasoning models produce both reasoning tokens and completion tokens that are treated differently in the API today. This also has implications for cache utilization and latency. To illustrate the point, we include this helpful sketch.\n", |
281 | 281 | "\n", |
282 | | - "\n", |
283 | | - "\n" |
| 282 | + "" |
284 | 283 | ] |
285 | 284 | }, |
286 | 285 | { |
287 | 286 | "cell_type": "markdown", |
288 | 287 | "metadata": {}, |
289 | 288 | "source": [ |
290 | | - "Note that in turn 2, reasoning items from turn 1 will be ignored and stripped since the model does not reuse reasoning items from previous turns, which is why it is impossible to get a full cache hit on the fourth API call in the diagram above as the prompt now exclude the reasoning items. That being said, we can still include them without harm as the API will strip reasoning items that are irrelevant in the current turn automatically. Keep in mind that cacheing will only become relevant for prompts that are longer than 1024 tokens in length. In our tests, we were able to get cache utilization to go from 40% to 80% of the input prompt by moving from Completions to Responses API. With better cache utilization comes better economics as cached tokens get billed significantly less than uncached ones: for `o4-mini`, cached input tokens are 75% cheaper than uncached input tokens. It will also improve latency as well. " |
| 289 | + "Note that in turn 2, reasoning items from turn 1 will be ignored and stripped, since the model does not reuse reasoning items from previous turns. This is why it is impossible to get a full cache hit on the fourth API call in the diagram above, as the prompt now excludes the reasoning items. That being said, we can still include them without harm, as the API will automatically strip reasoning items that are irrelevant in the current turn. Keep in mind that caching will only become relevant for prompts that are longer than 1024 tokens in length. In our tests, we were able to get cache utilization to go from 40% to 80% of the input prompt by moving from Completions to the Responses API. With better cache utilization comes better economics, as cached tokens are billed significantly less than uncached ones: for `o4-mini`, cached input tokens are 75% cheaper than uncached input tokens. It will also improve latency as well." |
291 | 290 | ] |
292 | 291 | }, |
293 | 292 | { |
|
296 | 295 | "source": [ |
297 | 296 | "## Encrypted Reasoning Items\n", |
298 | 297 | "\n", |
299 | | - "For organizations who cannot use Responses API in a stateful way due to compliance and data requirement constraints (e.g if your organization is under [Zero Data Retention](https://openai.com/enterprise-privacy/)), we've recently rolled out [encrypted reasoning items](https://platform.openai.com/docs/guides/reasoning?api-mode=responses#encrypted-reasoning-items) which allows you reap all the benefits mentioned above while continuing the use the responses API in a stateless way.\n", |
| 298 | + "For organizations that cannot use the Responses API in a stateful way due to compliance and data requirement constraints (e.g., if your organization is under [Zero Data Retention](https://openai.com/enterprise-privacy/)), we've recently rolled out [encrypted reasoning items](https://platform.openai.com/docs/guides/reasoning?api-mode=responses#encrypted-reasoning-items), which allow you to reap all the benefits mentioned above while continuing to use the Responses API in a stateless way.\n", |
300 | 299 | "\n", |
301 | | - "To leverage this, all you have to do is to include `[ \"reasoning.encrypted_content\" ]` as a part of the `include` field. By doing so we will pass an encrypted version of the reaasoning tokens to you that you can then pass back just like how you pass back reasoning items before.\n", |
| 300 | + "To leverage this, all you have to do is include `[\"reasoning.encrypted_content\"]` as part of the `include` field. By doing so, we will pass an encrypted version of the reasoning tokens to you, which you can then pass back just as you would with reasoning items before.\n", |
302 | 301 | "\n", |
303 | | - "If your org is under Zero Data Retention (ZDR), OpenAI automatically enforces `store=false` settings at the API level. When a user’s request comes into the Responses API, we first check for any `encrypted_content` included in the payload. If present, this content is decrypted in-memory using keys to which only OpenAI has access. This decrypted reasoning content (i.e., chain-of-thought) is never written to disk and is used solely to inform the model’s next response. Once the model generates its output, any new reasoning tokens it produces are immediately encrypted and returned to the client as part of the response payload. At that point, all transient data from the request—including both decrypted inputs and model outputs—is securely discarded. No intermediate state is persisted to disk, ensuring full compliance with ZDR.\n", |
| 302 | + "If your organization is under Zero Data Retention (ZDR), OpenAI automatically enforces `store=false` settings at the API level. When a user’s request comes into the Responses API, we first check for any `encrypted_content` included in the payload. If present, this content is decrypted in-memory using keys to which only OpenAI has access. This decrypted reasoning content (i.e., chain-of-thought) is never written to disk and is used solely to inform the model’s next response. Once the model generates its output, any new reasoning tokens it produces are immediately encrypted and returned to the client as part of the response payload. At that point, all transient data from the request—including both decrypted inputs and model outputs—is securely discarded. No intermediate state is persisted to disk, ensuring full compliance with ZDR.\n", |
304 | 303 | "\n", |
305 | | - "Here is a quick modified version of the above code snippet to demonstrate this" |
| 304 | + "Here is a quick modified version of the above code snippet to demonstrate this:" |
306 | 305 | ] |
307 | 306 | }, |
308 | 307 | { |
|
344 | 343 | "cell_type": "markdown", |
345 | 344 | "metadata": {}, |
346 | 345 | "source": [ |
347 | | - "Wtih `include=[\"reasoning.encrypted_content\"]` set, we now see a `encrypted_content` field in the reasoning item being passed back, this encrypted content represent the model's reasoning state. persisted entirely on the client side with OpenAI retaining no data. We can then pass this back like how we did with the reasoning item like before." |
| 346 | + "With `include=[\"reasoning.encrypted_content\"]` set, we now see an `encrypted_content` field in the reasoning item being passed back. This encrypted content represents the model's reasoning state, persisted entirely on the client side with OpenAI retaining no data. We can then pass this back just as we did with the reasoning item before." |
348 | 347 | ] |
349 | 348 | }, |
350 | 349 | { |
|
390 | 389 | "cell_type": "markdown", |
391 | 390 | "metadata": {}, |
392 | 391 | "source": [ |
393 | | - "With a simple change to the `include` field, we can now pass back the encrypted reasoning item and use it to improve the model's performance in intelligence, cost and latency.\n", |
| 392 | + "With a simple change to the `include` field, we can now pass back the encrypted reasoning item and use it to improve the model's performance in intelligence, cost, and latency.\n", |
394 | 393 | "\n", |
395 | | - "Now you should be fully equipped with knowledge to be able to fully utilize our latest reasoning models!" |
| 394 | + "Now you should be fully equipped with the knowledge to fully utilize our latest reasoning models!" |
396 | 395 | ] |
397 | 396 | } |
398 | 397 | ], |
|
0 commit comments