Skip to content

Commit 8d4e27a

Browse files
committed
condensing for cleanness
1 parent 96681b6 commit 8d4e27a

File tree

1 file changed

+7
-41
lines changed

1 file changed

+7
-41
lines changed

examples/Prompt_migration_guide.ipynb

Lines changed: 7 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -705,9 +705,9 @@
705705
"source": [
706706
"### Current Example\n",
707707
"\n",
708-
"Let’s evaluate whether our current working prompt has improved as a result of prompt migration. The original prompt, drawn from this [paper](https://arxiv.org/pdf/2306.05685), is designed to serve as a judge between two assistants’ answers. Conveniently, the paper provides a set of human-annotated judgments and assesses the LLM judge based on its agreement with these human ground truths.\n",
708+
"Let’s evaluate whether our current prompt migration has actually improved for the task of this judge. The original prompt, drawn from this [paper](https://arxiv.org/pdf/2306.05685), is designed to serve as a judge between two assistants’ answers. Conveniently, the paper provides a set of human-annotated ground truths, so we can measure how often the LLM judge agrees with the humans judgments.\n",
709709
"\n",
710-
"Our goal here is to measure how closely the judgments generated by our migrated prompt align with human evaluations. For context, the benchmark we’re using is a subset of MT-Bench, which features multi-turn conversations. In this example, we’re evaluating 200 conversation rows, each comparing the performance of different model pairs.\n",
710+
"Thus, our metric of success will be measuring how closely the judgments generated by our migrated prompt align with human evaluations compared to the judgments generated with our baseline prompt. For context, the benchmark we’re using is a subset of MT-Bench, which features multi-turn conversations. In this example, we’re evaluating 200 conversation rows, each comparing the performance of different model pairs.\n",
711711
"\n"
712712
]
713713
},
@@ -716,65 +716,31 @@
716716
"id": "6f50f9a0",
717717
"metadata": {},
718718
"source": [
719-
"On our evaluation subset, a useful reference point is human-human agreement, since each conversation is rated by multiple annotators. For turn 1 (without ties), humans agree with each other in 81% of cases, and for turn 2, in 76% of cases."
719+
"On our evaluation subset, a useful reference anchor is human-human agreement, since each conversation is rated by multiple annotators. Humans do not always agree with each other on which assistant answer is better, so we wouldn't expect our judge to achieve perfect agreement either. For turn 1 (without ties), humans agree with each other in 81% of cases, and for turn 2, in 76% of cases."
720720
]
721721
},
722722
{
723723
"cell_type": "markdown",
724-
"id": "20e4d610",
725-
"metadata": {},
726-
"source": [
727-
"![Graph 1 for Model Agreement](../images/prompt_migrator_fig1.png)"
728-
]
729-
},
730-
{
731-
"cell_type": "markdown",
732-
"id": "a3eccb6c",
733-
"metadata": {},
734-
"source": [
735-
"Comparing this to our models before migration, GPT-4 (as used in the paper) achieves an agreement with human judgments of 74% on turn 1 and 71% on turn 2, which is not bad, but still below the human-human upper bound."
736-
]
737-
},
738-
{
739-
"cell_type": "markdown",
740-
"id": "91dc3d38",
741-
"metadata": {},
742-
"source": [
743-
"![Graph 2 for Model Agreement](../images/prompt_migrator_fig2.png)"
744-
]
745-
},
746-
{
747-
"cell_type": "markdown",
748-
"id": "9f7e206f",
724+
"id": "7af0337b",
749725
"metadata": {},
750726
"source": [
751-
"\n",
752-
"Switching to GPT-4.1 (using the same prompt) improves the agreement: 78% (65/83) on turn 1 and 72% (61/85) on turn 2."
727+
"![Graph 3 for Model Agreement](../images/prompt_migrator_fig3.png)"
753728
]
754729
},
755730
{
756731
"cell_type": "markdown",
757732
"id": "800da674",
758733
"metadata": {},
759734
"source": [
760-
"\n",
761-
"Finally, after migrating and tuning our prompt specifically for GPT-4.1, the agreement climbs further, reaching 80% on turn 1 and 72% on turn 2, very close to matching the level of agreement seen between human annotators."
762-
]
763-
},
764-
{
765-
"cell_type": "markdown",
766-
"id": "7af0337b",
767-
"metadata": {},
768-
"source": [
769-
"![Graph 3 for Model Agreement](../images/prompt_migrator_fig3.png)"
735+
"Comparing this to our models before migration, GPT-4 (as used in the paper) achieves an agreement with human judgments of 74% on turn 1 and 71% on turn 2, which is not bad, but still below the human-human ceiling. Switching to GPT-4.1 (using the same prompt) improves the agreement: 77% on turn 1 and 72% on turn 2. Finally, after migrating and tuning our prompt specifically for GPT-4.1, the agreement climbs further, reaching 80% on turn 1 and 72% on turn 2, very close to matching the level of agreement seen between human annotators."
770736
]
771737
},
772738
{
773739
"cell_type": "markdown",
774740
"id": "43ae2ba5",
775741
"metadata": {},
776742
"source": [
777-
"Viewed all together, we can see that prompt migration and model upgrades improve agreement on our sample task. Go ahead and try it on yours!"
743+
"Viewed all together, we can see that prompt migration and upgrading to more powerful models improve agreement on our sample task. Considering that the baseline prompt and the task were on the simpler side, there is less optimization to squeeze out of tuning the 4.1 instructions. However, on more complex tasks, you can expect a much larger bump! Go ahead and try it on your prompt now!"
778744
]
779745
},
780746
{

0 commit comments

Comments
 (0)