|
6 | 6 | "source": [
|
7 | 7 | "# Optimize Prompts\n",
|
8 | 8 | "\n",
|
9 |
| - "This cookbook provides a look into an early version of OpenAI's prompt optimization system. Crafting effective prompts is a critical skill when working with AI models. Even experienced users can inadvertently introduce contradictions, ambiguities, or inconsistencies that lead to suboptimal results. The system demonstrated here helps identify and fix common issues, resulting in more reliable and effective prompts.\n", |
| 9 | + "Crafting effective prompts is a critical skill when working with AI models. Even experienced users can inadvertently introduce contradictions, ambiguities, or inconsistencies that lead to suboptimal results. The system demonstrated here helps identify and fix common issues, resulting in more reliable and effective prompts.\n", |
10 | 10 | "\n",
|
11 | 11 | "The optimization process uses a multi-agent approach with specialized AI agents collaborating to analyze and rewrite prompts. The system automatically identifies and addresses several types of common issues:\n",
|
12 | 12 | "\n",
|
|
16 | 16 | "\n",
|
17 | 17 | "---\n",
|
18 | 18 | "\n",
|
19 |
| - "**Objective**: This notebook demonstrates best practices for creating a useful and robust agent system and helps you develop more effective prompts for your applications.\n", |
| 19 | + "**Objective**: This cookbook demonstrates best practices for using Agents SDK together with Evals to build an early version of OpenAI's prompt optimization system. You can optimize your prompt using this code or use the optimizer [in our playground!](https://platform.openai.com/playground/prompts)\n", |
| 20 | + "\n", |
| 21 | + "\n", |
| 22 | + "\n", |
| 23 | + "\n", |
| 24 | + "\n", |
| 25 | + "\n", |
| 26 | + "\n", |
| 27 | + "\n", |
| 28 | + "Ask ChatGPT\n", |
| 29 | + "\n", |
20 | 30 | "\n",
|
21 | 31 | "**Cookbook Structure** \n",
|
22 | 32 | "This notebook follows this structure:\n",
|
23 | 33 | "\n",
|
24 | 34 | "- [Step 1. System Overview](#1-system-overview) - Learn how the prompt optimization system works \n",
|
25 | 35 | "- [Step 2. Data Models](#2-data-models) - Understand the data structures used by the system\n",
|
26 | 36 | "- [Step 3. Defining the Agents](#3-defining-the-agents) - Look at agents that analyze and improve prompts\n",
|
27 |
| - "- [Step 4. Run Optimization Workflow](#4-run-optimization-workflow) - See how the workflow hands off the prompts\n", |
28 |
| - "- [Step 5. Examples](#5-examples) - Explore real-world examples of prompt optimization\n", |
| 37 | + "- [Step 4. Evaluations](#4-using-evaluations-to-arrive-at-these-agents) - Use Evals to verify our agent model choice and instructions\n", |
| 38 | + "- [Step 5. Run Optimization Workflow](#4-run-optimization-workflow) - See how the workflow hands off the prompts\n", |
| 39 | + "- [Step 6. Examples](#5-examples) - Explore real-world examples of prompt optimization\n", |
29 | 40 | "\n",
|
30 | 41 | "**Prerequisites**\n",
|
31 | 42 | "- The `openai` Python package \n",
|
|
173 | 184 | },
|
174 | 185 | {
|
175 | 186 | "cell_type": "code",
|
176 |
| - "execution_count": 7, |
| 187 | + "execution_count": null, |
177 | 188 | "metadata": {},
|
178 | 189 | "outputs": [],
|
179 | 190 | "source": [
|
|
182 | 193 | " model=\"gpt-4.1\",\n",
|
183 | 194 | " output_type=Issues,\n",
|
184 | 195 | " instructions=\"\"\"\n",
|
185 |
| - " You are **Dev-Contradiction-Checker-v2**.\n", |
| 196 | + " You are **Dev-Contradiction-Checker**.\n", |
186 | 197 | "\n",
|
187 | 198 | " Goal\n",
|
188 | 199 | " Detect *genuine* self-contradictions or impossibilities **inside** the developer prompt supplied in the variable `DEVELOPER_MESSAGE`.\n",
|
|
215 | 226 | " model=\"gpt-4.1\",\n",
|
216 | 227 | " output_type=Issues,\n",
|
217 | 228 | " instructions=\"\"\"\n",
|
218 |
| - " You are Format-Checker-v2.\n", |
| 229 | + " You are Format-Checker.\n", |
219 | 230 | "\n",
|
220 | 231 | " Task\n",
|
221 | 232 | " Decide whether the developer prompt requires a structured output (JSON/CSV/XML/Markdown table, etc.).\n",
|
|
246 | 257 | " model=\"gpt-4.1\",\n",
|
247 | 258 | " output_type=FewShotIssues,\n",
|
248 | 259 | " instructions=\"\"\"\n",
|
249 |
| - " You are FewShot-Consistency-Checker-v3.\n", |
| 260 | + " You are FewShot-Consistency-Checker.\n", |
250 | 261 | "\n",
|
251 | 262 | " Goal\n",
|
252 | 263 | " Find conflicts between the DEVELOPER_MESSAGE rules and the accompanying **assistant** examples.\n",
|
|
306 | 317 | " model=\"gpt-4.1\",\n",
|
307 | 318 | " output_type=DevRewriteOutput,\n",
|
308 | 319 | " instructions=\"\"\"\n",
|
309 |
| - " You are Dev-Rewriter-v2.\n", |
| 320 | + " You are Dev-Rewriter.\n", |
310 | 321 | "\n",
|
311 | 322 | " You receive:\n",
|
312 | 323 | " - ORIGINAL_DEVELOPER_MESSAGE\n",
|
|
338 | 349 | " model=\"gpt-4.1\",\n",
|
339 | 350 | " output_type=MessagesOutput,\n",
|
340 | 351 | " instructions=\"\"\"\n",
|
341 |
| - " You are FewShot-Rewriter-v2.\n", |
| 352 | + " You are FewShot-Rewriter.\n", |
342 | 353 | "\n",
|
343 | 354 | " Input payload\n",
|
344 | 355 | " - NEW_DEVELOPER_MESSAGE (already optimized)\n",
|
|
373 | 384 | "cell_type": "markdown",
|
374 | 385 | "metadata": {},
|
375 | 386 | "source": [
|
376 |
| - "## 4. Run Optimization Workflow\n", |
| 387 | + "## 4. Using Evaluations to Arrive at these Agents\n", |
| 388 | + "\n", |
| 389 | + "Let's see how we used our Evals to tune agent prompts + pick models. We constructed a set of golden examples: each one contains original messages (developer message + user/assistant message) and the changes our optimization workflow should make. Here are two example of golden pairs that we used:" |
| 390 | + ] |
| 391 | + }, |
| 392 | + { |
| 393 | + "cell_type": "code", |
| 394 | + "execution_count": null, |
| 395 | + "metadata": { |
| 396 | + "vscode": { |
| 397 | + "languageId": "javascript" |
| 398 | + } |
| 399 | + }, |
| 400 | + "outputs": [], |
| 401 | + "source": [ |
| 402 | + "[\n", |
| 403 | + " {\n", |
| 404 | + " \"focus\": \"contradiction_issues\",\n", |
| 405 | + " \"input_payload\": {\n", |
| 406 | + " \"developer_message\": \"Always answer in **English**.\\nNunca respondas en inglés.\",\n", |
| 407 | + " \"messages\": [\n", |
| 408 | + " {\n", |
| 409 | + " \"role\": \"user\",\n", |
| 410 | + " \"content\": \"¿Qué hora es?\"\n", |
| 411 | + " }\n", |
| 412 | + " ]\n", |
| 413 | + " },\n", |
| 414 | + " \"golden_output\": {\n", |
| 415 | + " \"changes\": true,\n", |
| 416 | + " \"new_developer_message\": \"Always answer **in English**.\",\n", |
| 417 | + " \"new_messages\": [\n", |
| 418 | + " {\n", |
| 419 | + " \"role\": \"user\",\n", |
| 420 | + " \"content\": \"¿Qué hora es?\"\n", |
| 421 | + " }\n", |
| 422 | + " ],\n", |
| 423 | + " \"contradiction_issues\": \"Developer message simultaneously insists on English and forbids it.\",\n", |
| 424 | + " \"few_shot_contradiction_issues\": \"\",\n", |
| 425 | + " \"format_issues\": \"\",\n", |
| 426 | + " \"general_improvements\": \"\"\n", |
| 427 | + " }\n", |
| 428 | + " },\n", |
| 429 | + " {\n", |
| 430 | + " \"focus\": \"few_shot_contradiction_issues\",\n", |
| 431 | + " \"input_payload\": {\n", |
| 432 | + " \"developer_message\": \"Respond with **only 'yes' or 'no'** – no explanations.\",\n", |
| 433 | + " \"messages\": [\n", |
| 434 | + " {\n", |
| 435 | + " \"role\": \"user\",\n", |
| 436 | + " \"content\": \"Is the sky blue?\"\n", |
| 437 | + " },\n", |
| 438 | + " {\n", |
| 439 | + " \"role\": \"assistant\",\n", |
| 440 | + " \"content\": \"Yes, because wavelengths …\"\n", |
| 441 | + " },\n", |
| 442 | + " {\n", |
| 443 | + " \"role\": \"user\",\n", |
| 444 | + " \"content\": \"Is water wet?\"\n", |
| 445 | + " },\n", |
| 446 | + " {\n", |
| 447 | + " \"role\": \"assistant\",\n", |
| 448 | + " \"content\": \"Yes.\"\n", |
| 449 | + " }\n", |
| 450 | + " ]\n", |
| 451 | + " },\n", |
| 452 | + " \"golden_output\": {\n", |
| 453 | + " \"changes\": true,\n", |
| 454 | + " \"new_developer_message\": \"Respond with **only** the single word \\\"yes\\\" or \\\"no\\\".\",\n", |
| 455 | + " \"new_messages\": [\n", |
| 456 | + " {\n", |
| 457 | + " \"role\": \"user\",\n", |
| 458 | + " \"content\": \"Is the sky blue?\"\n", |
| 459 | + " },\n", |
| 460 | + " {\n", |
| 461 | + " \"role\": \"assistant\",\n", |
| 462 | + " \"content\": \"yes\"\n", |
| 463 | + " },\n", |
| 464 | + " {\n", |
| 465 | + " \"role\": \"user\",\n", |
| 466 | + " \"content\": \"Is water wet?\"\n", |
| 467 | + " },\n", |
| 468 | + " {\n", |
| 469 | + " \"role\": \"assistant\",\n", |
| 470 | + " \"content\": \"yes\"\n", |
| 471 | + " }\n", |
| 472 | + " ],\n", |
| 473 | + " \"contradiction_issues\": \"\",\n", |
| 474 | + " \"few_shot_contradiction_issues\": \"Assistant examples include explanations despite instruction not to.\",\n", |
| 475 | + " \"format_issues\": \"\",\n", |
| 476 | + " \"general_improvements\": \"\"\n", |
| 477 | + " }\n", |
| 478 | + " }\n", |
| 479 | + " ]" |
| 480 | + ] |
| 481 | + }, |
| 482 | + { |
| 483 | + "cell_type": "markdown", |
| 484 | + "metadata": {}, |
| 485 | + "source": [ |
| 486 | + "From these 20 hand labelled golden outputs which cover a range of contradiction issues, few shot issues, format issues, no issues, or a combination of issues, we built a python string check grader to verify two things: whether an issue was detected for each golden pair and whether the detected issue matched the expected one. From this signal, we tuned the agent instructions and which model to use to maximize our accuracy across this evaluation. We landed on the 4.1 model as a balance between accuracy, cost, and speed. The specific prompts we used also follow the 4.1 prompting guide. As you can see, we achieve the correct labels on all 20 golden outputs: identifying the right issues and avoiding false positives. " |
| 487 | + ] |
| 488 | + }, |
| 489 | + { |
| 490 | + "cell_type": "markdown", |
| 491 | + "metadata": {}, |
| 492 | + "source": [ |
| 493 | + "" |
| 494 | + ] |
| 495 | + }, |
| 496 | + { |
| 497 | + "cell_type": "markdown", |
| 498 | + "metadata": {}, |
| 499 | + "source": [ |
| 500 | + "" |
| 501 | + ] |
| 502 | + }, |
| 503 | + { |
| 504 | + "cell_type": "markdown", |
| 505 | + "metadata": {}, |
| 506 | + "source": [ |
| 507 | + "## 5. Run Optimization Workflow\n", |
377 | 508 | "\n",
|
378 |
| - "Let's dive into how the optimization system actually works. The core workflow consists of multiple runs of the agents in parallel to efficiently process and optimize prompts." |
| 509 | + "Let's dive into how the optimization system actually works end to end. The core workflow consists of multiple runs of the agents in parallel to efficiently process and optimize prompts." |
379 | 510 | ]
|
380 | 511 | },
|
381 | 512 | {
|
|
484 | 615 | "cell_type": "markdown",
|
485 | 616 | "metadata": {},
|
486 | 617 | "source": [
|
487 |
| - "## 5. Examples\n", |
| 618 | + "## 6. Examples\n", |
488 | 619 | "\n",
|
489 | 620 | "Let's see the optimization system in action with some practical examples."
|
490 | 621 | ]
|
|
0 commit comments