|
27 | 27 | "source": [
|
28 | 28 | "## Migrating and Optimizing Prompts\n",
|
29 | 29 | "\n",
|
30 |
| - "Crafting effective prompts is a critical skill when working with LLMs. The goal of the Prompt Optimizer is to give your prompt the target model best practices and formatting most effective for our models. The Optimizer also removes common prompting failure modes such as: \n", |
| 30 | + "Crafting effective prompts is a critical skill when working with LLMs. The goal of the Prompt Optimizer is to give your prompt the best practices and formatting most effective for our models. The Optimizer also removes common prompting failure modes such as: \n", |
31 | 31 | "\n",
|
32 | 32 | "• Contradictions in the prompt instructions \n",
|
33 | 33 | "•\tMissing or unclear format specifications \n",
|
|
89 | 89 | "\n",
|
90 | 90 | "### Coding and Analytics: Streaming Top‑K Frequent Words \n",
|
91 | 91 | "\n",
|
92 |
| - "We start with a task in a field that model has seen significant improvements: Coding and Analytics. We will ask the model to generate a Python script that computes the exact Top‑K most frequent tokens from a large text stream using a specific tokenization spec. Tasks like these are sensitive to poor prompting, as they can push the model toward the wrong algorithms and approaches (approximate sketches vs multi‑pass/disk‑backed exact solutions), dramatically changing accuracy and runtime.\n", |
| 92 | + "We start with a task in a field that model has seen significant improvements: Coding and Analytics. We will ask the model to generate a Python script that computes the exact Top‑K most frequent tokens from a large text stream using a specific tokenization spec. Tasks like these are highly sensitive to poor prompting as they can push the model toward the wrong algorithms and approaches (approximate sketches vs multi‑pass/disk‑backed exact solutions), dramatically affecting accuracy and runtime.\n", |
93 | 93 | "\n",
|
94 | 94 | "For this task, we will evaluate:\n",
|
95 | 95 | "1. Compilation/Execution success over 30 runs\n",
|
|
106 | 106 | "metadata": {},
|
107 | 107 | "source": [
|
108 | 108 | "### Our Baseline Prompt\n",
|
109 |
| - "For our example, let's look at a typical starting prompt with some minor **contradictions in the prompt**, and **ambigous or underspecified instructions**. Contradictions in instructions often reduce performance and increase latency, especially in reasoning models like GPT-5, and ambigous instructions can cause unwanted behaviours. " |
| 109 | + "For our example, let's look at a typical starting prompt with some minor **contradictions in the prompt**, and **ambiguous or underspecified instructions**. Contradictions in instructions often reduce performance and increase latency, especially in reasoning models like GPT-5, and ambiguous instructions can cause unwanted behaviors. " |
110 | 110 | ]
|
111 | 111 | },
|
112 | 112 | {
|
|
130 | 130 | "\"\"\"\n"
|
131 | 131 | ]
|
132 | 132 | },
|
| 133 | + { |
| 134 | + "cell_type": "markdown", |
| 135 | + "id": "66ae7a26", |
| 136 | + "metadata": {}, |
| 137 | + "source": [] |
| 138 | + }, |
133 | 139 | {
|
134 | 140 | "cell_type": "markdown",
|
135 | 141 | "id": "01b0e8b3",
|
136 | 142 | "metadata": {},
|
137 | 143 | "source": [
|
138 |
| - "This baseline prompt is something that you could expect from asking ChatGPT write you a prompt, or talking to a friend who is knowledgable about coding but not particularly invested in your specific use case. Our baseline prompt is intentionally shorter and friendlier-but it hides mixed signals that can push the model into inconsistent solution families.\n", |
| 144 | + "This baseline prompt is something that you could expect from asking ChatGPT to write you a prompt, or talking to a friend who is knowledgeable about coding but not particularly invested in your specific use case. Our baseline prompt is intentionally shorter and friendlier, but it hides mixed signals that can push the model into inconsistent solution families.\n", |
139 | 145 | "\n",
|
140 | 146 | "First, we say to prefer the standard library, then immediately allow external packages “if they make things simpler.” That soft permission can nudge the model toward non‑portable dependencies or heavier imports that change performance and even execution success across environments.\n",
|
141 | 147 | "\n",
|
|
249 | 255 | "source": [
|
250 | 256 | "Now let's use the prompt optimization tool in the console to improve our prompt and then review the results. We can start by going to the [OpenAI Optimize Playground](#https://platform.openai.com/chat/edit?optimize=true), and pasting our existing prompt in the Developer Message section.\n",
|
251 | 257 | "\n",
|
252 |
| - "From there press the **Optimize** button. This will open the optimization panel. From here you can optionally provide specifics you want to see reflected in the prompt, or you can just press **Optimize** to optimize the prompt for the target model best practices and task. To start let's just optimize our prompt.\n", |
| 258 | + "From there press the **Optimize** button. This will open the optimization panel. At this stage, you can either provide specific edits you'd like to see reflected in the prompt or simply press **Optimize** to have it refined according to best practices for the target model and task. To start let's do just this.\n", |
253 | 259 | "\n",
|
254 | 260 | "\n",
|
255 | 261 | "\n",
|
|
444 | 450 | "source": [
|
445 | 451 | "### Adding LLM-as-a-Judge Grading \n",
|
446 | 452 | "\n",
|
447 |
| - "Along with more quantitative evaluations we can measure the models performance on more qualitative metrics like code quality, and task adherance. We have created a sample prompt for this called ``llm_as_judge.txt``. " |
| 453 | + "Along with more quantitative evaluations we can measure the models performance on more qualitative metrics like code quality, and task adherence. We have created a sample prompt for this called ``llm_as_judge.txt``. " |
448 | 454 | ]
|
449 | 455 | },
|
450 | 456 | {
|
|
613 | 619 | "We will run FailSafeQA evaluations via the helper script and compare Baseline vs Optimized prompts side by side."
|
614 | 620 | ]
|
615 | 621 | },
|
| 622 | + { |
| 623 | + "cell_type": "code", |
| 624 | + "execution_count": null, |
| 625 | + "id": "c5849f77", |
| 626 | + "metadata": {}, |
| 627 | + "outputs": [], |
| 628 | + "source": [] |
| 629 | + }, |
616 | 630 | {
|
617 | 631 | "cell_type": "code",
|
618 | 632 | "execution_count": 3,
|
|
632 | 646 | "id": "0a817cd8",
|
633 | 647 | "metadata": {},
|
634 | 648 | "source": [
|
635 |
| - "We can use the prompt optimizer once again to construct a new prompt that is more suitable for this use case. Maybe as someone who has read about long context question and answering best practices we know that we should remind our answer model to rely on information in the context section and refuse answers to questions if the context is insufficient. By using the Optimize button once without any arguments we get a reasonable structure for the prompt and end up with this as our optimized prompt.\n", |
| 649 | + "We can use the prompt optimizer once again to construct a new prompt that is more suitable for this use case. Drawing on best practices for long-context question answering, we know that we should remind our answer model to rely on information in the context section and refuse answers to questions if the context is insufficient. By using the Optimize button once without any arguments we get a reasonable structure for the prompt and end up with this as our optimized prompt.\n", |
636 | 650 | "\n",
|
637 | 651 | "\n",
|
638 | 652 | "\n",
|
|
677 | 691 | "id": "2516f981",
|
678 | 692 | "metadata": {},
|
679 | 693 | "source": [
|
680 |
| - "Let's now run our evaluations, for demonstration we will display the results of a single comparision, but you can also run the full evaluation. Note: This will take time." |
| 694 | + "Let's now run our evaluations, for demonstration we will display the results of a single comparison, but you can also run the full evaluation. Note: This will take time." |
681 | 695 | ]
|
682 | 696 | },
|
683 | 697 | {
|
|
888 | 902 | "id": "0a84939c",
|
889 | 903 | "metadata": {},
|
890 | 904 | "source": [
|
891 |
| - "GPT-5-mini crushes this task, so even the baseline prompt gets scores of >= 4 almost all of the time. However if we compare the percent of perfect scores (6/6) for the judge, we see that the optimize prompt has way signficantly more perfect answers when evaluated in the two categories of FailSafeQA answer quality: robustness and context grounding." |
| 905 | + "GPT-5-mini crushes this task, so even the baseline prompt gets scores of >= 4 almost all of the time. However if we compare the percent of perfect scores (6/6) for the judge, we see that the optimize prompt has way significantly more perfect answers when evaluated in the two categories of FailSafeQA answer quality: robustness and context grounding." |
892 | 906 | ]
|
893 | 907 | },
|
894 | 908 | {
|
|
0 commit comments