Merge branch 'main' into agents-sdk-pm-advisor-cookbook

rajpathak-openai · web-flow · commit 58c983449db8 · 2025-05-27T18:36:34.000-04:00
diff --git a/authors.yaml b/authors.yaml
@@ -17,7 +17,7 @@ prashantmital-openai:
   website: "https://www.linkedin.com/in/pmital/"
   avatar: "https://avatars.githubusercontent.com/u/173949238?v=4"
 
-theophile-openai:
+theophile-oai:
   name: "Theophile Sautory"
   website: "https://www.linkedin.com/in/theophilesautory"
   avatar: "https://avatars.githubusercontent.com/u/206768658?v=4"
diff --git a/examples/Reinforcement_Fine_Tuning.ipynb b/examples/Reinforcement_Fine_Tuning.ipynb
@@ -689,7 +689,7 @@
     "\n",
     "We can visualize the full score distribution on the training set.\n",
     "\n",
-    "> **Note:** : In practice, analyzing model errors at scale often involves a mix of manual review and automated methods-like tagging failure types or clustering predictions by score and content. That workflow is beyond the scope of this guide, but it's a valuable next step once you've identified broad patterns."
+    "> Note: In practice, analyzing model errors at scale often involves a mix of manual review and automated methods-like tagging failure types or clustering predictions by score and content. That workflow is beyond the scope of this guide, but it's a valuable next step once you've identified broad patterns."
    ]
   },
   {
@@ -1968,7 +1968,7 @@
    "source": [
     "Looking at the distruibution of scores, we observe that RFT helped shift the model’s predictions out of the mid-to-low score zone (0.4–0.5) and into the mid-to-high range (0.5–0.6). Since the grader emphasizes clinical similarity over lexical match, this shift reflects stronger medical reasoning-not just better phrasing-according to our *expert* grader. As observed in the 0.9-1.0 range, some verbosity crept in despite mitigations and slightly lowering scores throughout, though it often reflected more complete, semantically aligned answers. A future grader pass could better account for these cases.\n",
     "\n",
-    "Note, because the earlier `combined_grader` was designed to reward lexical correctness, its accuracy didnʼt improve much-which is expected. That gap reinforces why validating your model grader is critical, and why you should monitor for reward-hacking. In our case, we used `o3` to spot-check grading behavior, but domain expert review is essential. "
+    "Note that, because the earlier `combined_grader` was designed to reward lexical correctness, its accuracy didnʼt improve much-which is expected. That gap reinforces why validating your model grader is critical, and why you should monitor for reward-hacking. In our case, we used `o3` to spot-check grading behavior, but domain expert review is essential. "
    ]
   },
   {
@@ -2019,22 +2019,17 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 1,
    "metadata": {},
    "outputs": [
     {
-     "data": {
-      "text/markdown": [
-       "**Classifying staging type**\n",
-       "\n",
-       "The user provided a clinical scenario of a 35-year-old female with a 5 cm oral tumor and a 2 cm lymph node. They're asking how to stage it according to the TNM classification. This is a diagnosis query, so the correct answer type here is \"diagnosis.\" Considering the tumor's size, it appears to be classified as T3 since it's greater than 4 cm. Thus, I think the staging might be Stage II, but I'll confirm that."
-      ],
-      "text/plain": [
-       "<IPython.core.display.Markdown object>"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Classifying staging type\n",
+      "\n",
+      "The user provided a clinical scenario of a 35-year-old female with a 5 cm oral tumor and a 2 cm lymph node. They're asking how to stage it according to the TNM classification. This is a diagnosis query, so the correct answer type here is \"diagnosis.\" Considering the tumor's size, it appears to be classified as T3 since it's greater than 4 cm. Thus, I think the staging might be Stage II, but I'll confirm that.\n"
+     ]
     }
    ],
    "source": [
@@ -2045,27 +2040,21 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 2,
    "metadata": {},
    "outputs": [
     {
-     "data": {
-      "text/markdown": [
-       "**Clarifying T staging for cancers**\n",
-       "\n",
-       "I’m digging into T staging for head and neck cancers in the oral cavity. So, T1 applies to tumors 2 cm or less, T2 for those over 2 cm but not more than 4 cm, and T3 is for tumors over 4 cm. T4a indicates invasion into adjacent structures. The patient's tumor measures 5 cm, which is over 4 cm. I’m not sure if it fits T3 or T4a, since T4a involves additional invasiveness, not just size.\n",
-       "**Determining T and N staging**\n",
-       "\n",
-       "I’m looking at a 5 cm tumor in the oral cavity. It seems there’s no mention of invasion into adjacent structures, so I’m categorizing it as T3 due to its size. T4a usually means invasion into structures like bone or skin. According to the TNM classification, since I see no such invasion, T classification remains T3.\n",
-       "\n",
-       "Moving on to N staging, I see there's a single lymph node of 2 cm on the same side; this fits the N1 classification for metastasis, as it’s less than 3 cm."
-      ],
-      "text/plain": [
-       "<IPython.core.display.Markdown object>"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Clarifying T staging for cancers\n",
+      "\n",
+      "I’m digging into T staging for head and neck cancers in the oral cavity. So, T1 applies to tumors 2 cm or less, T2 for those over 2 cm but not more than 4 cm, and T3 is for tumors over 4 cm. T4a indicates invasion into adjacent structures. The patient's tumor measures 5 cm, which is over 4 cm. I’m not sure if it fits T3 or T4a, since T4a involves additional invasiveness, not just size. Determining T and N staging\n",
+      "\n",
+      "I’m looking at a 5 cm tumor in the oral cavity. It seems there’s no mention of invasion into adjacent structures, so I’m categorizing it as T3 due to its size. T4a usually means invasion into structures like bone or skin. According to the TNM classification, since I see no such invasion, T classification remains T3.\n",
+      "\n",
+      "Moving on to N staging, I see there's a single lymph node of 2 cm on the same side; this fits the N1 classification for metastasis, as it’s less than 3 cm.\n"
+     ]
     }
    ],
    "source": [
diff --git a/examples/o-series/o3o4-mini_prompting_guide.ipynb b/examples/o-series/o3o4-mini_prompting_guide.ipynb
@@ -176,10 +176,28 @@
     "# Responses API\n",
     "\n",
     "### Reasoning Items for Better Performance\n",
-    "We’ve released a [cookbook](https://cookbook.openai.com/examples/responses_api/reasoning_items) detailing the benefits of using the responses API. It is worth restating a few of the main points in this guide as well. o3/o4-mini are both trained with its internal reasoning persisted between toolcalls within a single turn. Persisting these reasoning items between toolcalls during inference will therefore lead to higher intelligence and performance in the form of better decision in when and how a tool gets called. Responses allow you to persist these reasoning items (maintained either by us or yourself through encrypted content if you do not want us to handle state-management) while Chat Completion doesn’t. Switching to the responses API and allowing the model access to reasoning items between function calls is the easiest way to squeeze out as much performance as possible for function calls. Here is an the example in the cookbook, reproduced for convenience, showing how you can pass back the reasoning item using `encrypted_content` in a way which we do not retain any state on our end\n",
-    "\n",
-    "```\n",
+    "We’ve released a [cookbook](https://cookbook.openai.com/examples/responses_api/reasoning_items) detailing the benefits of using the responses API. It is worth restating a few of the main points in this guide as well. o3/o4-mini are both trained with its internal reasoning persisted between toolcalls within a single turn. Persisting these reasoning items between toolcalls during inference will therefore lead to higher intelligence and performance in the form of better decision in when and how a tool gets called. Responses allow you to persist these reasoning items (maintained either by us or yourself through encrypted content if you do not want us to handle state-management) while Chat Completion doesn’t. Switching to the responses API and allowing the model access to reasoning items between function calls is the easiest way to squeeze out as much performance as possible for function calls. Here is an the example in the cookbook, reproduced for convenience, showing how you can pass back the reasoning item using `encrypted_content` in a way which we do not retain any state on our end:\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "The current temperature in Paris is about 18.8 °C.\n"
+     ]
+    }
+   ],
+   "source": [
+    "from openai import OpenAI\n",
     "import requests\n",
+    "import json\n",
+    "client = OpenAI()\n",
+    "\n",
     "\n",
     "def get_weather(latitude, longitude):\n",
     "    response = requests.get(f\"https://api.open-meteo.com/v1/forecast?latitude={latitude}&longitude={longitude}&current=temperature_2m,wind_speed_10m&hourly=temperature_2m,relative_humidity_2m,wind_speed_10m\")\n",
@@ -206,14 +224,37 @@
     "context = [{\"role\": \"user\", \"content\": \"What's the weather like in Paris today?\"}]\n",
     "\n",
     "response = client.responses.create(\n",
-    "    model=\"o4-mini\",\n",
+    "    model=\"o3\",\n",
     "    input=context,\n",
     "    tools=tools,\n",
+    "    store=False,\n",
+    "    include=[\"reasoning.encrypted_content\"] # Encrypted chain of thought is passed back in the response\n",
     ")\n",
     "\n",
     "\n",
-    "response.output\n",
-    "```\n"
+    "context += response.output # Add the response to the context (including the encrypted chain of thought)\n",
+    "tool_call = response.output[1]\n",
+    "args = json.loads(tool_call.arguments)\n",
+    "\n",
+    "\n",
+    "\n",
+    "result = get_weather(args[\"latitude\"], args[\"longitude\"])\n",
+    "\n",
+    "context.append({                               \n",
+    "    \"type\": \"function_call_output\",\n",
+    "    \"call_id\": tool_call.call_id,\n",
+    "    \"output\": str(result)\n",
+    "})\n",
+    "\n",
+    "response_2 = client.responses.create(\n",
+    "    model=\"o3\",\n",
+    "    input=context,\n",
+    "    tools=tools,\n",
+    "    store=False,\n",
+    "    include=[\"reasoning.encrypted_content\"]\n",
+    ")\n",
+    "\n",
+    "print(response_2.output_text)"
    ]
   },
   {
diff --git a/registry.yaml b/registry.yaml
@@ -33,7 +33,7 @@
   path: examples/Reinforcement_Fine_Tuning.ipynb
   date: 2025-05-23
   authors:
-    - theophile-openai
+    - theophile-oai
   tags:
     - reinforcement-learning
     - fine-tuning