|
15 | 15 | "source": [
|
16 | 16 | "The GPT-5 Family of models are the smartest models we’ve released to date, representing a step change in the models’ capabilities across the board. GPT-5 is particularly specialized in agentic task performance, coding, and steerability, making it a great fit for everyone from curious users to advanced researchers. \n",
|
17 | 17 | "\n",
|
18 |
| - "GPT-5 will benefit from all the traditional prompting best practices, and to help you construct the best prompt we are introducing a [Prompting Guide for GPT-5](#) explaining how to make the most of its state-of-the-art capabilities. Alongside that, we are introducing a [GPT-5 Specific Prompt Optimizer](#https://platform.openai.com/chat/edit?optimize=true) in our Playground to help users get started on **improving existing prompts** and **migrating prompts** for GPT-5 and other OpenAI models.\n", |
| 18 | + "GPT-5 will benefit from all the traditional prompting best practices, and to help you construct the best prompt we are introducing a [Prompting Guide for GPT-5](https://cookbook.openai.com/examples/gpt-5/gpt-5_prompting_guide) explaining how to make the most of its state-of-the-art capabilities. Alongside that, we are introducing a [GPT-5 Specific Prompt Optimizer](https://platform.openai.com/chat/edit?optimize=true) in our Playground to help users get started on **improving existing prompts** and **migrating prompts** for GPT-5 and other OpenAI models.\n", |
19 | 19 | "\n",
|
20 |
| - "In this cookbook we will go through how you can get spun up quickly to solve your task with GPT-5. We will share results of measurable improvements on common tasks and walk you through how you can use the Prompt Optimizer to do the same.\n" |
| 20 | + "In this cookbook we will go through how you can get spun up quickly to solve your task with GPT-5. We will share results of measurable improvements on common tasks and walk you through how you can use the Prompt Optimizer to do the same." |
21 | 21 | ]
|
22 | 22 | },
|
23 | 23 | {
|
|
35 | 35 | "\n",
|
36 | 36 | "Along with tuning the prompt for the target model, the Optimizer is cognizant of the specific task you are trying to accomplish and can apply crucial practices to boost performance in Agentic Workflows, Coding and Multi-Modality. Let's walk through some before-and-afters to see where prompt optimization shines. \n",
|
37 | 37 | "\n",
|
38 |
| - "> [!NOTE]\n", |
39 | 38 | "> Remember that prompting is not a one-size-fits-all experience, so we recommend running thorough experiments and iterating to find the best solution for your problem."
|
40 | 39 | ]
|
41 | 40 | },
|
|
44 | 43 | "id": "8fcbc964",
|
45 | 44 | "metadata": {},
|
46 | 45 | "source": [
|
47 |
| - "> [!IMPORTANT]\n", |
| 46 | + "\n", |
48 | 47 | "> Ensure you have set up your OpenAI API Key set as `OPENAI_API_KEY` and have access to GPT-5\n"
|
49 | 48 | ]
|
50 | 49 | },
|
51 | 50 | {
|
52 | 51 | "cell_type": "code",
|
53 |
| - "execution_count": null, |
| 52 | + "execution_count": 1, |
54 | 53 | "id": "5a0d077c",
|
55 | 54 | "metadata": {},
|
56 |
| - "outputs": [], |
| 55 | + "outputs": [ |
| 56 | + { |
| 57 | + "name": "stdout", |
| 58 | + "output_type": "stream", |
| 59 | + "text": [ |
| 60 | + "OPENAI_API_KEY is set!\n" |
| 61 | + ] |
| 62 | + } |
| 63 | + ], |
57 | 64 | "source": [
|
58 | 65 | "import os\n",
|
59 | 66 | "\n",
|
|
111 | 118 | },
|
112 | 119 | {
|
113 | 120 | "cell_type": "code",
|
114 |
| - "execution_count": null, |
| 121 | + "execution_count": 4, |
115 | 122 | "id": "377cc6f4",
|
116 | 123 | "metadata": {},
|
117 | 124 | "outputs": [],
|
|
130 | 137 | "\"\"\"\n"
|
131 | 138 | ]
|
132 | 139 | },
|
133 |
| - { |
134 |
| - "cell_type": "markdown", |
135 |
| - "id": "66ae7a26", |
136 |
| - "metadata": {}, |
137 |
| - "source": [] |
138 |
| - }, |
139 | 140 | {
|
140 | 141 | "cell_type": "markdown",
|
141 | 142 | "id": "01b0e8b3",
|
|
257 | 258 | "\n",
|
258 | 259 | "From there press the **Optimize** button. This will open the optimization panel. At this stage, you can either provide specific edits you'd like to see reflected in the prompt or simply press **Optimize** to have it refined according to best practices for the target model and task. To start let's do just this.\n",
|
259 | 260 | "\n",
|
260 |
| - "\n", |
| 261 | + "\n", |
261 | 262 | "\n",
|
262 | 263 | "\n",
|
263 | 264 | "\n",
|
|
269 | 270 | "\n",
|
270 | 271 | "This is easy using the iterative process of the Prompt Optimizer.\n",
|
271 | 272 | "\n",
|
272 |
| - "\n", |
| 273 | + "\n", |
273 | 274 | "\n"
|
274 | 275 | ]
|
275 | 276 | },
|
|
280 | 281 | "source": [
|
281 | 282 | "Once we are happy with the optimized version of our prompt, we can save it as a [Prompt Object](#https://platform.openai.com/docs/guides/prompt-engineering#reusable-prompts) using a button on the top right of the optimizer. We can use this object within our API Calls which can help with future iteration, version management, and reusability across different applications. \n",
|
282 | 283 | "\n",
|
283 |
| - "\n" |
| 284 | + "\n" |
284 | 285 | ]
|
285 | 286 | },
|
286 | 287 | {
|
|
548 | 549 | },
|
549 | 550 | "metadata": {},
|
550 | 551 | "output_type": "display_data"
|
| 552 | + }, |
| 553 | + { |
| 554 | + "name": "stdout", |
| 555 | + "output_type": "stream", |
| 556 | + "text": [ |
| 557 | + "### Prompt Optimization Results - Coding Tasks\n", |
| 558 | + "\n", |
| 559 | + "| Metric | Baseline | Optimized | Δ (Opt − Base) |\n", |
| 560 | + "|----------------------------|---------:|----------:|---------------:|\n", |
| 561 | + "| Avg Time (s) | 7.906 | 6.977 | -0.929 |\n", |
| 562 | + "| Peak Memory (KB) | 3626.3 | 577.5 | -3048.8 |\n", |
| 563 | + "| Exact (%) | 100.0 | 100.0 | 0.0 |\n", |
| 564 | + "| Sorted (%) | 100.0 | 100.0 | 0.0 |\n", |
| 565 | + "| LLM Adherence (1–5) | 4.40 | 4.90 | +0.50 |\n", |
| 566 | + "| Code Quality (1–5) | 4.73 | 4.90 | +0.16 |\n" |
| 567 | + ] |
551 | 568 | }
|
552 | 569 | ],
|
553 | 570 | "source": [
|
|
573 | 590 | " judge_optimized=Path(\"results_llm_as_judge_optimized\")/\"judgement_summary.csv\",\n",
|
574 | 591 | ")\n",
|
575 | 592 | "\n",
|
576 |
| - "display(Markdown(md))" |
| 593 | + "display(Markdown(md))\n", |
| 594 | + "\n", |
| 595 | + "print(md)" |
577 | 596 | ]
|
578 | 597 | },
|
579 | 598 | {
|
|
603 | 622 | "\n",
|
604 | 623 | "Most production use cases face imperfect queries and noisy context. **FailSafeQA** is an excellent benchmark that deliberately perturbs both the **query** (misspellings, incompleteness, off-domain phrasing) and the **context** (missing, OCR-corrupted, or irrelevant docs) and reports **Robustness**, **Context Grounding**, and **Compliance**—i.e., can the model answer when the signal exists and abstain when it doesn’t.\n",
|
605 | 624 | "\n",
|
606 |
| - "\n", |
| 625 | + "\n", |
607 | 626 | "\n",
|
608 | 627 | "**Links**\n",
|
609 | 628 | "- Paper (arXiv): *Expect the Unexpected: FailSafe Long Context QA for Finance* — https://arxiv.org/abs/2502.06329 \n",
|
|
619 | 638 | "We will run FailSafeQA evaluations via the helper script and compare Baseline vs Optimized prompts side by side."
|
620 | 639 | ]
|
621 | 640 | },
|
622 |
| - { |
623 |
| - "cell_type": "code", |
624 |
| - "execution_count": null, |
625 |
| - "id": "c5849f77", |
626 |
| - "metadata": {}, |
627 |
| - "outputs": [], |
628 |
| - "source": [] |
629 |
| - }, |
630 | 641 | {
|
631 | 642 | "cell_type": "code",
|
632 | 643 | "execution_count": 3,
|
|
649 | 660 | "We can use the prompt optimizer once again to construct a new prompt that is more suitable for this use case. Drawing on best practices for long-context question answering, we know that we should remind our answer model to rely on information in the context section and refuse answers to questions if the context is insufficient. By using the Optimize button once without any arguments we get a reasonable structure for the prompt and end up with this as our optimized prompt.\n",
|
650 | 661 | "\n",
|
651 | 662 | "\n",
|
652 |
| - "\n", |
| 663 | + "\n", |
653 | 664 | "\n"
|
654 | 665 | ]
|
655 | 666 | },
|
|
834 | 845 | },
|
835 | 846 | {
|
836 | 847 | "cell_type": "code",
|
837 |
| - "execution_count": 1, |
| 848 | + "execution_count": 11, |
838 | 849 | "id": "c20097e6",
|
839 | 850 | "metadata": {},
|
840 | 851 | "outputs": [
|
|
845 | 856 | "\n",
|
846 | 857 | "**Compliance threshold:** ≥ 6\n",
|
847 | 858 | "\n",
|
848 |
| - "| Metric | Baseline | Optimized | Δ (Opt − Base) |\n", |
849 |
| - "|---|---:|---:|---:|\n", |
850 |
| - "| Robustness (avg across datapoints) | 0.320 | 0.540 | +0.220 |\n", |
851 |
| - "| Context Grounding (avg across datapoints) | 0.800 | 0.950 | +0.150 |\n", |
| 859 | + "| Metric | Baseline | Optimized | Δ (Opt − Base) |\n", |
| 860 | + "| ----------------------------------------- | -------- | --------- | -------------- |\n", |
| 861 | + "| Robustness (avg across datapoints) | 0.320 | 0.540 | +0.220 |\n", |
| 862 | + "| Context Grounding (avg across datapoints) | 0.800 | 0.950 | +0.150 |\n", |
852 | 863 | "\n",
|
853 | 864 | "_Source files:_ `results_failsafeqa.csv` · `results_failsafeqa.csv`"
|
854 | 865 | ],
|
|
858 | 869 | },
|
859 | 870 | "metadata": {},
|
860 | 871 | "output_type": "display_data"
|
| 872 | + }, |
| 873 | + { |
| 874 | + "name": "stdout", |
| 875 | + "output_type": "stream", |
| 876 | + "text": [ |
| 877 | + "## FailSafeQA — Summary\n", |
| 878 | + "\n", |
| 879 | + "**Compliance threshold:** ≥ 6\n", |
| 880 | + "\n", |
| 881 | + "| Metric | Baseline | Optimized | Δ (Opt − Base) |\n", |
| 882 | + "| ----------------------------------------- | -------- | --------- | -------------- |\n", |
| 883 | + "| Robustness (avg across datapoints) | 0.320 | 0.540 | +0.220 |\n", |
| 884 | + "| Context Grounding (avg across datapoints) | 0.800 | 0.950 | +0.150 |\n", |
| 885 | + "\n", |
| 886 | + "_Source files:_ `results_failsafeqa.csv` · `results_failsafeqa.csv`\n" |
| 887 | + ] |
861 | 888 | }
|
862 | 889 | ],
|
863 | 890 | "source": [
|
|
872 | 899 | ") -> str:\n",
|
873 | 900 | " d_r = robust_opt - robust_base\n",
|
874 | 901 | " d_g = ground_opt - ground_base\n",
|
| 902 | + "\n", |
| 903 | + " # Data rows\n", |
| 904 | + " rows = [\n", |
| 905 | + " [\"Metric\", \"Baseline\", \"Optimized\", \"Δ (Opt − Base)\"],\n", |
| 906 | + " [\"Robustness (avg across datapoints)\", f\"{robust_base:.3f}\", f\"{robust_opt:.3f}\", f\"{d_r:+.3f}\"],\n", |
| 907 | + " [\"Context Grounding (avg across datapoints)\", f\"{ground_base:.3f}\", f\"{ground_opt:.3f}\", f\"{d_g:+.3f}\"],\n", |
| 908 | + " ]\n", |
| 909 | + "\n", |
| 910 | + " # Calculate column widths for alignment\n", |
| 911 | + " col_widths = [max(len(str(row[i])) for row in rows) for i in range(len(rows[0]))]\n", |
| 912 | + "\n", |
| 913 | + " # Build table lines with padding\n", |
| 914 | + " lines = []\n", |
| 915 | + " for i, row in enumerate(rows):\n", |
| 916 | + " padded = [str(cell).ljust(col_widths[j]) for j, cell in enumerate(row)]\n", |
| 917 | + " lines.append(\"| \" + \" | \".join(padded) + \" |\")\n", |
| 918 | + " if i == 0: # after header\n", |
| 919 | + " sep = [\"-\" * col_widths[j] for j in range(len(row))]\n", |
| 920 | + " lines.append(\"| \" + \" | \".join(sep) + \" |\")\n", |
| 921 | + "\n", |
| 922 | + " table = \"\\n\".join(lines)\n", |
| 923 | + "\n", |
875 | 924 | " return f\"\"\"\n",
|
876 | 925 | "## FailSafeQA — Summary\n",
|
877 | 926 | "\n",
|
878 | 927 | "**Compliance threshold:** ≥ {threshold}\n",
|
879 | 928 | "\n",
|
880 |
| - "| Metric | Baseline | Optimized | Δ (Opt − Base) |\n", |
881 |
| - "|---|---:|---:|---:|\n", |
882 |
| - "| Robustness (avg across datapoints) | {robust_base:.3f} | {robust_opt:.3f} | {d_r:+.3f} |\n", |
883 |
| - "| Context Grounding (avg across datapoints) | {ground_base:.3f} | {ground_opt:.3f} | {d_g:+.3f} |\n", |
| 929 | + "{table}\n", |
884 | 930 | "\n",
|
885 | 931 | "_Source files:_ `{src_base}` · `{src_opt}`\n",
|
886 | 932 | "\"\"\".strip()\n",
|
887 | 933 | "\n",
|
888 |
| - "# Fill in with your reported numbers\n", |
| 934 | + "# Usage\n", |
889 | 935 | "md = build_markdown_summary_from_metrics(\n",
|
890 | 936 | " robust_base=0.320, ground_base=0.800,\n",
|
891 | 937 | " robust_opt=0.540, ground_opt=0.950,\n",
|
|
894 | 940 | " src_opt=\"results_failsafeqa.csv\",\n",
|
895 | 941 | ")\n",
|
896 | 942 | "\n",
|
897 |
| - "display(Markdown(md))" |
| 943 | + "# Notebook pretty\n", |
| 944 | + "display(Markdown(md))\n", |
| 945 | + "\n", |
| 946 | + "print(md)" |
898 | 947 | ]
|
899 | 948 | },
|
900 | 949 | {
|
|
921 | 970 | ],
|
922 | 971 | "metadata": {
|
923 | 972 | "kernelspec": {
|
924 |
| - "display_name": "Python 3", |
| 973 | + "display_name": "Python 3.12 (jupyter)", |
925 | 974 | "language": "python",
|
926 |
| - "name": "python3" |
| 975 | + "name": "jupyter-py312" |
927 | 976 | },
|
928 | 977 | "language_info": {
|
929 | 978 | "codemirror_mode": {
|
|
935 | 984 | "name": "python",
|
936 | 985 | "nbconvert_exporter": "python",
|
937 | 986 | "pygments_lexer": "ipython3",
|
938 |
| - "version": "3.11.13" |
| 987 | + "version": "3.12.10" |
939 | 988 | }
|
940 | 989 | },
|
941 | 990 | "nbformat": 4,
|
|
0 commit comments