openai
diff --git a/‎examples/Reinforcement_Fine_Tuning.ipynb‎
Lines changed: 6 additions & 6 deletions b/‎examples/Reinforcement_Fine_Tuning.ipynb‎
Lines changed: 6 additions & 6 deletions
diff --git a/‎examples/evaluation/use-cases/mcp_eval_notebook.ipynb‎
Lines changed: 26 additions & 0 deletions b/‎examples/evaluation/use-cases/mcp_eval_notebook.ipynb‎
Lines changed: 26 additions & 0 deletions
diff --git a/‎examples/o-series/o3o4-mini_prompting_guide.ipynb‎
Lines changed: 12 additions & 0 deletions b/‎examples/o-series/o3o4-mini_prompting_guide.ipynb‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎examples/partners/eval_driven_system_design/receipt_inspection.ipynb‎
Lines changed: 3 additions & 12 deletions b/‎examples/partners/eval_driven_system_design/receipt_inspection.ipynb‎
Lines changed: 3 additions & 12 deletions
diff --git a/‎examples/partners/mcp_powered_voice_agents/database.db‎
20 KB b/‎examples/partners/mcp_powered_voice_agents/database.db‎
20 KB
@@ -1339,8 +1339,8 @@
    "outputs": [],
    "source": [
     "# Set your training and test file paths\n",
-    "train_file = \"data/medical_01_verifiable_problem_train_with_prompt.jsonl\"\n",
-    "test_file = \"data/medical_01_verifiable_problem_val_with_prompt.jsonl\"\n",
+    "train_file = \"data/medical_01_verifiable_problem_train_simple_prompt.jsonl\"\n",
+    "test_file = \"data/medical_01_verifiable_problem_val_simple_prompt.jsonl\"\n",
     "\n",
     "def upload_file(file_path: str) -> str:\n",
     "    \"\"\"Upload a file to the OpenAI platform for fine-tuning.\"\"\"\n",
@@ -1389,7 +1389,7 @@
     "grader = model_grader_2\n",
     "response_format = None\n",
     "compute_multiplier = 1.0\n",
-    "etest_samples = 1\n",
+    "eval_samples = 1\n",
     "eval_interval = 5"
    ]
   },
@@ -1409,7 +1409,7 @@
     "# Launch the RFT job\n",
     "payload = dict(\n",
     "    training_file=train_file_id,\n",
-    "    test_file=test_file_id,\n",
+    "    validation_file=test_file_id,\n",
     "    model=model,\n",
     "    suffix=suffix,\n",
     "    method=dict(\n",
@@ -1419,7 +1419,7 @@
     "            response_format=response_format,\n",
     "            hyperparameters=dict(\n",
     "                compute_multiplier=compute_multiplier,\n",
-    "                etest_samples=etest_samples,\n",
+    "                eval_samples=eval_samples,\n",
     "                eval_interval=eval_interval,\n",
     "                n_epochs=n_epochs,\n",
     "                reasoning_effort=reasoning_effort,\n",
@@ -2116,7 +2116,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.11.8"
+   "version": "3.12.9"
   }
  },
  "nbformat": 4,
 
@@ -450,6 +450,8 @@
    "id": "ee1f655b",
    "metadata": {},
    "source": [
+    "Note that the 4.1 model was constructed to never use its tools to answer the query thus it never called the MCP server. The o4-mini model wasn't explicitly instructed to use it's tools either but it wasn't forbidden, thus it called the MCP server 3 times. We can see that the 4.1 model performed worse than the o4 model. Also notable is the one example that the o4-mini model failed was one where the MCP tool was not used.\n",
+    "\n",
     "We can also check a detailed analysis of the outputs from each model for manual inspection and further analysis."
    ]
   },
@@ -806,6 +808,30 @@
     "    print(item.sample.output[0].content)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "0936def6",
+   "metadata": {},
+   "source": [
+    "## How can we improve?\n",
+    "\n",
+    "If we add the phrase \"Always use your tools since they are the way to get the right answer in this task.\" to the system message of the o4-mini model, what do you think will happen? (try it out)\n",
+    "\n",
+    "<br><br><br>\n",
+    "\n",
+    "\n",
+    "If you guessed that the model would now call to MCP tool everytime and get every answer correct, you are right!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cf797a91",
+   "metadata": {},
+   "source": [
+    "![Evaluation Data Tab](../../../images/mcp_eval_improved_output.png)\n",
+    "![Evaluation Data Tab](../../../images/mcp_eval_improved_data.png)"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "924619e0",
 
@@ -163,6 +163,18 @@
     "Validate arguments against the format before sending the call; if you are unsure, ask for clarification instead of guessing.\n",
     "```\n",
     "\n",
+    "3. Another note on lazy behavior\n",
+    "We are aware of rare instances of lazy behavior from o3, such as stating it does not have enough time to complete a task, promising to follow up separately, or giving terse answers even when explicitly prompted to provide more detail. We have found that the following steps help ameliorate this behavior:\n",
+    "\n",
+    "    a. Start a new conversation for unrelated topics:\n",
+    "       When switching to a new or unrelated topic, begin a fresh conversation thread rather than continuing in the same context. This helps the model focus on the current subject and prevents it from being influenced by previous, irrelevant context, which can sometimes lead to incomplete or lazy responses. For example, if you were previously discussing code debugging and now want to ask about documentation best practices, which does not require previous conversation context, start a new conversation to ensure clarity and focus.\n",
+    "\n",
+    "    b. Discard irrelevant past tool calls/outputs when the list gets too long, and summarize them as context in the user message:\n",
+    "       If the conversation history contains a long list of previous tool calls or outputs that are no longer relevant, remove them from the context. Instead, provide a concise summary of the important information as part of the user message. This keeps the context manageable and ensures the model has access to only the most pertinent information. For instance, if you have a lengthy sequence of tool outputs, you can summarize the key results and include only that summary in your next message.\n",
+    "\n",
+    "    c. We are constantly improving our models and expect to have this issue addressed in future versions.\n",
+    "\n",
+    "\n",
     "### Avoid Chain of Thought Prompting\n",
     "Since these models are reasoning models and produce an internal chain of thought, they do not have to be explicitly prompted to plan and reason between toolcalls. Therefore, a developer should not try to induce additional reasoning before each function call by asking the model to plan more extensively. Asking a reasoning model to reason more may actually hurt the performance. \n",
     "\n",
 
@@ -139,18 +139,9 @@
     "\n",
     "### 2. Assemble Examples (Gather Data)\n",
     "\n",
-    "It's very rare that a real-world project will start with all the data necessary to get\n",
-    "to a satisfactory solution, much less to establish confidence.\n",
-    "\n",
-    "In our case, we're going to assume that we have a decent sample of system *inputs*, \n",
-    "in the form of but receipt images, but start without any fully annotated data. We find \n",
-    "this is a not-unusual situation when automating an existing process. Instead, \n",
-    "we'll walk through the process of building that out as we go along by collaborating with\n",
-    "domain experts, and make our evals progressively more comprehensive.\n",
-    "In our case, we're going to assume that we have a decent sample of system *inputs*\n",
-    "(here, photographs of receipts), but start without any fully annotated data. We'll walk\n",
-    "through the process of incrementally expanding our test and training sets as we go along\n",
-    "and make our evals progressively more comprehensive.\n",
+    "It's very rare for a real-world project to begin with all the data necessary to achieve a satisfactory solution, let alone establish confidence.\n",
+    "\n",
+    "In our case, we'll assume we have a decent sample of system *inputs*, in the form of but receipt images, but start without any fully annotated data. We find this is a not-unusual situation when automating an existing process. We'll walk through the process of incrementally expanding our test and training sets in collaboration with domain experts as we go along and make our evals progressively more comprehensive.\n",
     "\n",
     "### 3. Build an End-to-End V0 System\n",
     "\n",