Skip to content

Commit 96681b6

Browse files
committed
added delta
1 parent 5d9219a commit 96681b6

File tree

4 files changed

+80
-1
lines changed

4 files changed

+80
-1
lines changed

examples/Prompt_migration_guide.ipynb

Lines changed: 80 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -698,6 +698,85 @@
698698
"Consistent testing and refinement ensure your prompts consistently achieve their intended results."
699699
]
700700
},
701+
{
702+
"cell_type": "markdown",
703+
"id": "cac0dc7f",
704+
"metadata": {},
705+
"source": [
706+
"### Current Example\n",
707+
"\n",
708+
"Let’s evaluate whether our current working prompt has improved as a result of prompt migration. The original prompt, drawn from this [paper](https://arxiv.org/pdf/2306.05685), is designed to serve as a judge between two assistants’ answers. Conveniently, the paper provides a set of human-annotated judgments and assesses the LLM judge based on its agreement with these human ground truths.\n",
709+
"\n",
710+
"Our goal here is to measure how closely the judgments generated by our migrated prompt align with human evaluations. For context, the benchmark we’re using is a subset of MT-Bench, which features multi-turn conversations. In this example, we’re evaluating 200 conversation rows, each comparing the performance of different model pairs.\n",
711+
"\n"
712+
]
713+
},
714+
{
715+
"cell_type": "markdown",
716+
"id": "6f50f9a0",
717+
"metadata": {},
718+
"source": [
719+
"On our evaluation subset, a useful reference point is human-human agreement, since each conversation is rated by multiple annotators. For turn 1 (without ties), humans agree with each other in 81% of cases, and for turn 2, in 76% of cases."
720+
]
721+
},
722+
{
723+
"cell_type": "markdown",
724+
"id": "20e4d610",
725+
"metadata": {},
726+
"source": [
727+
"![Graph 1 for Model Agreement](../images/prompt_migrator_fig1.png)"
728+
]
729+
},
730+
{
731+
"cell_type": "markdown",
732+
"id": "a3eccb6c",
733+
"metadata": {},
734+
"source": [
735+
"Comparing this to our models before migration, GPT-4 (as used in the paper) achieves an agreement with human judgments of 74% on turn 1 and 71% on turn 2, which is not bad, but still below the human-human upper bound."
736+
]
737+
},
738+
{
739+
"cell_type": "markdown",
740+
"id": "91dc3d38",
741+
"metadata": {},
742+
"source": [
743+
"![Graph 2 for Model Agreement](../images/prompt_migrator_fig2.png)"
744+
]
745+
},
746+
{
747+
"cell_type": "markdown",
748+
"id": "9f7e206f",
749+
"metadata": {},
750+
"source": [
751+
"\n",
752+
"Switching to GPT-4.1 (using the same prompt) improves the agreement: 78% (65/83) on turn 1 and 72% (61/85) on turn 2."
753+
]
754+
},
755+
{
756+
"cell_type": "markdown",
757+
"id": "800da674",
758+
"metadata": {},
759+
"source": [
760+
"\n",
761+
"Finally, after migrating and tuning our prompt specifically for GPT-4.1, the agreement climbs further, reaching 80% on turn 1 and 72% on turn 2, very close to matching the level of agreement seen between human annotators."
762+
]
763+
},
764+
{
765+
"cell_type": "markdown",
766+
"id": "7af0337b",
767+
"metadata": {},
768+
"source": [
769+
"![Graph 3 for Model Agreement](../images/prompt_migrator_fig3.png)"
770+
]
771+
},
772+
{
773+
"cell_type": "markdown",
774+
"id": "43ae2ba5",
775+
"metadata": {},
776+
"source": [
777+
"Viewed all together, we can see that prompt migration and model upgrades improve agreement on our sample task. Go ahead and try it on yours!"
778+
]
779+
},
701780
{
702781
"cell_type": "markdown",
703782
"id": "c3ed1776",
@@ -883,7 +962,7 @@
883962
"name": "python",
884963
"nbconvert_exporter": "python",
885964
"pygments_lexer": "ipython3",
886-
"version": "3.11.8"
965+
"version": "3.12.9"
887966
}
888967
},
889968
"nbformat": 4,

images/prompt_migrator_fig1.png

25.9 KB
Loading

images/prompt_migrator_fig2.png

26.3 KB
Loading

images/prompt_migrator_fig3.png

36.4 KB
Loading

0 commit comments

Comments
 (0)