|
629 | 629 | "id": "f3490c8a-5ee8-44cd-ae5e-26a6ca7b4017", |
630 | 630 | "metadata": {}, |
631 | 631 | "source": [ |
632 | | - "### Install docling-sdg\n", |
| 632 | + "#### Install docling-sdg\n", |
633 | 633 | "\n", |
634 | | - "[Docling-sdg](https://github.com/docling-project/docling-sdg) project is used to generate question and answer pairs for seed examples." |
| 634 | + "This notebook uses [Docling-sdg](https://github.com/docling-project/docling-sdg) to generate question and answer pairs for each chunk." |
635 | 635 | ] |
636 | 636 | }, |
637 | 637 | { |
|
646 | 646 | }, |
647 | 647 | { |
648 | 648 | "cell_type": "markdown", |
649 | | - "id": "d65ec755-e3de-40ab-bf3a-23ebb29a705d", |
| 649 | + "id": "1497c44e-7b82-4646-b00f-e910688bfb3d", |
650 | 650 | "metadata": {}, |
651 | 651 | "source": [ |
652 | | - "### Initialize QA generator model & Number of Seed examples\n", |
| 652 | + "### Select the chunks for the seed examples\n", |
653 | 653 | "\n", |
654 | | - "To generate seed examples you need to set: \n", |
| 654 | + "Chunks for seed examples should be diverse in style. These can be selected by hand or selecting diverse chunks from all of the chunks using the [subset selection notebook](https://github.com/instructlab/examples/blob/main/notebooks/instructlab-knowledge/subset-selection.ipynb).\n", |
| 655 | + "\n", |
| 656 | + "If users are selecting chunks by hand, chunks should be taken directly from lines in `chunks.jsonl`. These lines have `chunk`, `file`, and `metadata` fields for each entry.\n", |
| 657 | + "\n", |
| 658 | + "The below code randomly selects a preset number of chunks and saves them in a jsonl file for the next step." |
| 659 | + ] |
| 660 | + }, |
| 661 | + { |
| 662 | + "cell_type": "code", |
| 663 | + "execution_count": null, |
| 664 | + "id": "fec2c039-a4bb-46cd-92aa-1b24dafb018e", |
| 665 | + "metadata": {}, |
| 666 | + "outputs": [], |
| 667 | + "source": [ |
| 668 | + "from utils.qna_gen import save_random_chunk_selection\n", |
| 669 | + "\n", |
| 670 | + "NUM_SEED_EXAMPLES = 7\n", |
| 671 | + "\n", |
| 672 | + "for contribution in contributions:\n", |
| 673 | + " chunks_jsonl_path = contribution[\"dir\"] / CHUNKING_DIR / \"chunks.jsonl\"\n", |
| 674 | + " authoring_path = contribution[\"dir\"] / AUTHORING_DIR\n", |
| 675 | + "\n", |
| 676 | + " selected_chunks_jsonl = save_random_chunk_selection(chunks_jsonl_path,\n", |
| 677 | + " authoring_path,\n", |
| 678 | + " NUM_SEED_EXAMPLES)\n", |
| 679 | + " print(f\"selected_chunks.jsonl saved to: {selected_chunks_jsonl}\")" |
| 680 | + ] |
| 681 | + }, |
| 682 | + { |
| 683 | + "cell_type": "markdown", |
| 684 | + "id": "051ae1b9-5351-4051-ab26-733e9d05a791", |
| 685 | + "metadata": {}, |
| 686 | + "source": [ |
| 687 | + "### Generate the questions and answers for each chunk\n", |
| 688 | + "\n", |
| 689 | + "To generate questions and answers you need to set: \n", |
655 | 690 | "1. The the Open AI compatible endpoint for the model generating question and answer pairs\n", |
656 | 691 | "2. The model's API key\n", |
657 | | - "3. The model's name\n", |
658 | | - "4. The number of seed example you wish to generate for each contribution" |
| 692 | + "3. The model's name" |
659 | 693 | ] |
660 | 694 | }, |
661 | 695 | { |
|
669 | 703 | "\n", |
670 | 704 | "API_KEY = os.getenv(\"MODEL_API_KEY\") or \"<INSERT API KEY HERE>\" # the API access key for your account (cannot be empty)\n", |
671 | 705 | "ENDPOINT_URL = os.getenv(\"MODEL_ENDPOINT_URL\") or \"<INSERT ENDPOINT URL HERE>\" # the URL of your model's API. URL can end in \"/v1\"\n", |
672 | | - "MODEL_NAME = os.getenv(\"MODEL_NAME\") or \"mistralai/Mixtral-8x7B-Instruct-v0.1\" # the name of your model\n", |
673 | | - "NUM_SEED_EXAMPLES = 7" |
| 706 | + "MODEL_NAME = os.getenv(\"MODEL_NAME\") or \"mistralai/Mixtral-8x7B-Instruct-v0.1\" # the name of your model" |
674 | 707 | ] |
675 | 708 | }, |
676 | 709 | { |
|
698 | 731 | }, |
699 | 732 | { |
700 | 733 | "cell_type": "markdown", |
701 | | - "id": "d9d5191d-cbd7-4c4e-b5e1-748ed8684eaf", |
| 734 | + "id": "d54cf5e5-339f-44ec-af46-7a023f94e994", |
702 | 735 | "metadata": {}, |
703 | 736 | "source": [ |
704 | | - "### Run QA Generation" |
| 737 | + "#### Generate questions and answers and create qna.yaml file" |
705 | 738 | ] |
706 | 739 | }, |
707 | 740 | { |
|
714 | 747 | "from utils.qna_gen import generate_seed_examples\n", |
715 | 748 | "\n", |
716 | 749 | "for contribution in contributions:\n", |
717 | | - " chunks_jsonl_path = contribution[\"dir\"] / CHUNKING_DIR / \"chunks.jsonl\"\n", |
718 | 750 | " authoring_path = contribution[\"dir\"] / AUTHORING_DIR\n", |
| 751 | + " selected_chunks_path = authoring_path / \"selected_chunks.jsonl\"\n", |
719 | 752 | "\n", |
720 | 753 | " qna_output_path = generate_seed_examples(contribution[\"name\"],\n", |
721 | | - " chunks_jsonl_path,\n", |
| 754 | + " selected_chunks_path,\n", |
722 | 755 | " authoring_path,\n", |
723 | | - " contribution[\"domain\"],\n", |
724 | | - " contribution[\"summary\"],\n", |
725 | | - " NUM_SEED_EXAMPLES,\n", |
726 | 756 | " API_KEY,\n", |
727 | 757 | " ENDPOINT_URL,\n", |
728 | 758 | " MODEL_NAME,\n", |
| 759 | + " contribution[\"domain\"],\n", |
| 760 | + " contribution[\"summary\"], \n", |
729 | 761 | " customization_str)\n", |
730 | 762 | " print(f\"qna.yaml saved to: {qna_output_path}\")\n" |
731 | 763 | ] |
|
737 | 769 | "source": [ |
738 | 770 | "### Review and Revise Seed Examples\n", |
739 | 771 | "\n", |
740 | | - "A quality set of seed examples has diverse contexts and question-and-answer pairs across every seed example. You can asses the `qna.yaml` files in your preferred text editor to ensure the quality, diversity, and style of generated questions and answers, and modify them accordingly.\n", |
| 772 | + "A quality set of seed examples has diverse contexts and question-and-answer pairs across every seed example. You can asses the `qna.yaml` files in your preferred text editor to ensure the quality, diversity, and style of generated questions and answers, and modify them accordingly." |
| 773 | + ] |
| 774 | + }, |
| 775 | + { |
| 776 | + "cell_type": "code", |
| 777 | + "execution_count": null, |
| 778 | + "id": "c80d9d9f-4796-4768-890e-5535939fa082", |
| 779 | + "metadata": {}, |
| 780 | + "outputs": [], |
| 781 | + "source": [ |
| 782 | + "from utils.qna_gen import view_seed_example\n", |
| 783 | + "\n", |
| 784 | + "index = 2 # index of seed example to view. Value must be lower than number of seed examples\n", |
741 | 785 | "\n", |
| 786 | + "# pass in path to qna.yaml file and seed example index to view single seed example\n", |
| 787 | + "view_seed_example(qna_output_path, index)" |
| 788 | + ] |
| 789 | + }, |
| 790 | + { |
| 791 | + "cell_type": "markdown", |
| 792 | + "id": "6272356c-5ab1-4864-80f0-bab393b65cc1", |
| 793 | + "metadata": {}, |
| 794 | + "source": [ |
742 | 795 | "After assessment, the `qna.yaml` files can be quickly reviewed to ensure they includes the required elements and correct number of each. It is recommended to have at least 5 seed examples. Each seed example must have 3 question and answer pairs." |
743 | 796 | ] |
744 | 797 | }, |
|
751 | 804 | "source": [ |
752 | 805 | "from utils.qna_gen import review_seed_examples_file\n", |
753 | 806 | "\n", |
754 | | - "\n", |
755 | | - "\n", |
756 | 807 | "for contribution in contributions:\n", |
757 | 808 | " qna_path = contribution[\"dir\"] / AUTHORING_DIR / \"qna.yaml\"\n", |
758 | 809 | " review_seed_examples_file(qna_path, min_seed_examples=5, num_qa_pairs=3)" |
|
862 | 913 | "name": "python", |
863 | 914 | "nbconvert_exporter": "python", |
864 | 915 | "pygments_lexer": "ipython3", |
865 | | - "version": "3.11.13" |
| 916 | + "version": "3.12.10" |
866 | 917 | } |
867 | 918 | }, |
868 | 919 | "nbformat": 4, |
|
0 commit comments