|
698 | 698 | "Consistent testing and refinement ensure your prompts consistently achieve their intended results."
|
699 | 699 | ]
|
700 | 700 | },
|
| 701 | + { |
| 702 | + "cell_type": "markdown", |
| 703 | + "id": "cac0dc7f", |
| 704 | + "metadata": {}, |
| 705 | + "source": [ |
| 706 | + "### Current Example\n", |
| 707 | + "\n", |
| 708 | + "Let’s evaluate whether our current working prompt has improved as a result of prompt migration. The original prompt, drawn from this [paper](https://arxiv.org/pdf/2306.05685), is designed to serve as a judge between two assistants’ answers. Conveniently, the paper provides a set of human-annotated judgments and assesses the LLM judge based on its agreement with these human ground truths.\n", |
| 709 | + "\n", |
| 710 | + "Our goal here is to measure how closely the judgments generated by our migrated prompt align with human evaluations. For context, the benchmark we’re using is a subset of MT-Bench, which features multi-turn conversations. In this example, we’re evaluating 200 conversation rows, each comparing the performance of different model pairs.\n", |
| 711 | + "\n" |
| 712 | + ] |
| 713 | + }, |
| 714 | + { |
| 715 | + "cell_type": "markdown", |
| 716 | + "id": "6f50f9a0", |
| 717 | + "metadata": {}, |
| 718 | + "source": [ |
| 719 | + "On our evaluation subset, a useful reference point is human-human agreement, since each conversation is rated by multiple annotators. For turn 1 (without ties), humans agree with each other in 81% of cases, and for turn 2, in 76% of cases." |
| 720 | + ] |
| 721 | + }, |
| 722 | + { |
| 723 | + "cell_type": "markdown", |
| 724 | + "id": "20e4d610", |
| 725 | + "metadata": {}, |
| 726 | + "source": [ |
| 727 | + "" |
| 728 | + ] |
| 729 | + }, |
| 730 | + { |
| 731 | + "cell_type": "markdown", |
| 732 | + "id": "a3eccb6c", |
| 733 | + "metadata": {}, |
| 734 | + "source": [ |
| 735 | + "Comparing this to our models before migration, GPT-4 (as used in the paper) achieves an agreement with human judgments of 74% on turn 1 and 71% on turn 2, which is not bad, but still below the human-human upper bound." |
| 736 | + ] |
| 737 | + }, |
| 738 | + { |
| 739 | + "cell_type": "markdown", |
| 740 | + "id": "91dc3d38", |
| 741 | + "metadata": {}, |
| 742 | + "source": [ |
| 743 | + "" |
| 744 | + ] |
| 745 | + }, |
| 746 | + { |
| 747 | + "cell_type": "markdown", |
| 748 | + "id": "9f7e206f", |
| 749 | + "metadata": {}, |
| 750 | + "source": [ |
| 751 | + "\n", |
| 752 | + "Switching to GPT-4.1 (using the same prompt) improves the agreement: 78% (65/83) on turn 1 and 72% (61/85) on turn 2." |
| 753 | + ] |
| 754 | + }, |
| 755 | + { |
| 756 | + "cell_type": "markdown", |
| 757 | + "id": "800da674", |
| 758 | + "metadata": {}, |
| 759 | + "source": [ |
| 760 | + "\n", |
| 761 | + "Finally, after migrating and tuning our prompt specifically for GPT-4.1, the agreement climbs further, reaching 80% on turn 1 and 72% on turn 2, very close to matching the level of agreement seen between human annotators." |
| 762 | + ] |
| 763 | + }, |
| 764 | + { |
| 765 | + "cell_type": "markdown", |
| 766 | + "id": "7af0337b", |
| 767 | + "metadata": {}, |
| 768 | + "source": [ |
| 769 | + "" |
| 770 | + ] |
| 771 | + }, |
| 772 | + { |
| 773 | + "cell_type": "markdown", |
| 774 | + "id": "43ae2ba5", |
| 775 | + "metadata": {}, |
| 776 | + "source": [ |
| 777 | + "Viewed all together, we can see that prompt migration and model upgrades improve agreement on our sample task. Go ahead and try it on yours!" |
| 778 | + ] |
| 779 | + }, |
701 | 780 | {
|
702 | 781 | "cell_type": "markdown",
|
703 | 782 | "id": "c3ed1776",
|
|
883 | 962 | "name": "python",
|
884 | 963 | "nbconvert_exporter": "python",
|
885 | 964 | "pygments_lexer": "ipython3",
|
886 |
| - "version": "3.11.8" |
| 965 | + "version": "3.12.9" |
887 | 966 | }
|
888 | 967 | },
|
889 | 968 | "nbformat": 4,
|
|
0 commit comments