Skip to content

Commit 1e9e231

Browse files
committed
Refactor authoring section to include byo chunks for contexts
Enable users to bring their own chunks as contexts for question and answer generation. Add quick review of seed examples after questions and answers are genrerated. Signed-off-by: Ali Maredia <[email protected]>
1 parent 17a8ed3 commit 1e9e231

File tree

2 files changed

+139
-28
lines changed

2 files changed

+139
-28
lines changed

notebooks/instructlab-knowledge/instructlab-knowledge.ipynb

Lines changed: 71 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -629,9 +629,9 @@
629629
"id": "f3490c8a-5ee8-44cd-ae5e-26a6ca7b4017",
630630
"metadata": {},
631631
"source": [
632-
"### Install docling-sdg\n",
632+
"#### Install docling-sdg\n",
633633
"\n",
634-
"[Docling-sdg](https://github.com/docling-project/docling-sdg) project is used to generate question and answer pairs for seed examples."
634+
"This notebook uses [Docling-sdg](https://github.com/docling-project/docling-sdg) to generate question and answer pairs for each chunk."
635635
]
636636
},
637637
{
@@ -646,16 +646,50 @@
646646
},
647647
{
648648
"cell_type": "markdown",
649-
"id": "d65ec755-e3de-40ab-bf3a-23ebb29a705d",
649+
"id": "1497c44e-7b82-4646-b00f-e910688bfb3d",
650650
"metadata": {},
651651
"source": [
652-
"### Initialize QA generator model & Number of Seed examples\n",
652+
"### Select the chunks for the seed examples\n",
653653
"\n",
654-
"To generate seed examples you need to set: \n",
654+
"Chunks for seed examples should be diverse in style. These can be selected by hand or selecting diverse chunks from all of the chunks using the [subset selection notebook](https://github.com/instructlab/examples/blob/main/notebooks/instructlab-knowledge/subset-selection.ipynb).\n",
655+
"\n",
656+
"If users are selecting chunks by hand, chunks should be taken directly from lines in `chunks.jsonl`. These lines have `chunk`, `file`, and `metadata` fields for each entry.\n",
657+
"\n",
658+
"The below code randomly selects a preset number of chunks and saves them in a jsonl file for the next step."
659+
]
660+
},
661+
{
662+
"cell_type": "code",
663+
"execution_count": null,
664+
"id": "fec2c039-a4bb-46cd-92aa-1b24dafb018e",
665+
"metadata": {},
666+
"outputs": [],
667+
"source": [
668+
"from utils.qna_gen import save_random_chunk_selection\n",
669+
"\n",
670+
"NUM_SEED_EXAMPLES = 7\n",
671+
"\n",
672+
"for contribution in contributions:\n",
673+
" chunks_jsonl_path = contribution[\"dir\"] / CHUNKING_DIR / \"chunks.jsonl\"\n",
674+
" authoring_path = contribution[\"dir\"] / AUTHORING_DIR\n",
675+
"\n",
676+
" selected_chunks_jsonl = save_random_chunk_selection(chunks_jsonl_path,\n",
677+
" authoring_path,\n",
678+
" NUM_SEED_EXAMPLES)\n",
679+
" print(f\"selected_chunks.jsonl saved to: {selected_chunks_jsonl}\")"
680+
]
681+
},
682+
{
683+
"cell_type": "markdown",
684+
"id": "051ae1b9-5351-4051-ab26-733e9d05a791",
685+
"metadata": {},
686+
"source": [
687+
"### Generate the questions and answers for each chunk\n",
688+
"\n",
689+
"To generate questions and answers you need to set: \n",
655690
"1. The the Open AI compatible endpoint for the model generating question and answer pairs\n",
656691
"2. The model's API key\n",
657-
"3. The model's name\n",
658-
"4. The number of seed example you wish to generate for each contribution"
692+
"3. The model's name"
659693
]
660694
},
661695
{
@@ -669,8 +703,7 @@
669703
"\n",
670704
"API_KEY = os.getenv(\"MODEL_API_KEY\") or \"<INSERT API KEY HERE>\" # the API access key for your account (cannot be empty)\n",
671705
"ENDPOINT_URL = os.getenv(\"MODEL_ENDPOINT_URL\") or \"<INSERT ENDPOINT URL HERE>\" # the URL of your model's API. URL can end in \"/v1\"\n",
672-
"MODEL_NAME = os.getenv(\"MODEL_NAME\") or \"mistralai/Mixtral-8x7B-Instruct-v0.1\" # the name of your model\n",
673-
"NUM_SEED_EXAMPLES = 7"
706+
"MODEL_NAME = os.getenv(\"MODEL_NAME\") or \"mistralai/Mixtral-8x7B-Instruct-v0.1\" # the name of your model"
674707
]
675708
},
676709
{
@@ -698,10 +731,10 @@
698731
},
699732
{
700733
"cell_type": "markdown",
701-
"id": "d9d5191d-cbd7-4c4e-b5e1-748ed8684eaf",
734+
"id": "d54cf5e5-339f-44ec-af46-7a023f94e994",
702735
"metadata": {},
703736
"source": [
704-
"### Run QA Generation"
737+
"#### Generate questions and answers and create qna.yaml file"
705738
]
706739
},
707740
{
@@ -714,18 +747,17 @@
714747
"from utils.qna_gen import generate_seed_examples\n",
715748
"\n",
716749
"for contribution in contributions:\n",
717-
" chunks_jsonl_path = contribution[\"dir\"] / CHUNKING_DIR / \"chunks.jsonl\"\n",
718750
" authoring_path = contribution[\"dir\"] / AUTHORING_DIR\n",
751+
" selected_chunks_path = authoring_path / \"selected_chunks.jsonl\"\n",
719752
"\n",
720753
" qna_output_path = generate_seed_examples(contribution[\"name\"],\n",
721-
" chunks_jsonl_path,\n",
754+
" selected_chunks_path,\n",
722755
" authoring_path,\n",
723-
" contribution[\"domain\"],\n",
724-
" contribution[\"summary\"],\n",
725-
" NUM_SEED_EXAMPLES,\n",
726756
" API_KEY,\n",
727757
" ENDPOINT_URL,\n",
728758
" MODEL_NAME,\n",
759+
" contribution[\"domain\"],\n",
760+
" contribution[\"summary\"], \n",
729761
" customization_str)\n",
730762
" print(f\"qna.yaml saved to: {qna_output_path}\")\n"
731763
]
@@ -737,8 +769,29 @@
737769
"source": [
738770
"### Review and Revise Seed Examples\n",
739771
"\n",
740-
"A quality set of seed examples has diverse contexts and question-and-answer pairs across every seed example. You can asses the `qna.yaml` files in your preferred text editor to ensure the quality, diversity, and style of generated questions and answers, and modify them accordingly.\n",
772+
"A quality set of seed examples has diverse contexts and question-and-answer pairs across every seed example. You can asses the `qna.yaml` files in your preferred text editor to ensure the quality, diversity, and style of generated questions and answers, and modify them accordingly."
773+
]
774+
},
775+
{
776+
"cell_type": "code",
777+
"execution_count": null,
778+
"id": "c80d9d9f-4796-4768-890e-5535939fa082",
779+
"metadata": {},
780+
"outputs": [],
781+
"source": [
782+
"from utils.qna_gen import view_seed_example\n",
783+
"\n",
784+
"index = 2 # index of seed example to view. Value must be lower than number of seed examples\n",
741785
"\n",
786+
"# pass in path to qna.yaml file and seed example index to view single seed example\n",
787+
"view_seed_example(qna_output_path, index)"
788+
]
789+
},
790+
{
791+
"cell_type": "markdown",
792+
"id": "6272356c-5ab1-4864-80f0-bab393b65cc1",
793+
"metadata": {},
794+
"source": [
742795
"After assessment, the `qna.yaml` files can be quickly reviewed to ensure they includes the required elements and correct number of each. It is recommended to have at least 5 seed examples. Each seed example must have 3 question and answer pairs."
743796
]
744797
},
@@ -751,8 +804,6 @@
751804
"source": [
752805
"from utils.qna_gen import review_seed_examples_file\n",
753806
"\n",
754-
"\n",
755-
"\n",
756807
"for contribution in contributions:\n",
757808
" qna_path = contribution[\"dir\"] / AUTHORING_DIR / \"qna.yaml\"\n",
758809
" review_seed_examples_file(qna_path, min_seed_examples=5, num_qa_pairs=3)"
@@ -862,7 +913,7 @@
862913
"name": "python",
863914
"nbconvert_exporter": "python",
864915
"pygments_lexer": "ipython3",
865-
"version": "3.11.13"
916+
"version": "3.12.10"
866917
}
867918
},
868919
"nbformat": 4,

notebooks/instructlab-knowledge/utils/qna_gen.py

Lines changed: 68 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -60,28 +60,56 @@ class IndentedDumper(yaml.Dumper):
6060
def increase_indent(self, flow=False, indentless=False):
6161
return super(IndentedDumper, self).increase_indent(flow, False)
6262

63-
def generate_seed_examples(contribution_name: str, chunks_jsonl_path: Path, output_dir: Path, domain: str, summary: str, num_seed_examples: int, api_key: str, api_url: str, model_id: str, customization_str: str | None = None) -> Path:
63+
def save_random_chunk_selection(chunks_jsonl_path: Path, output_dir: Path, num_seed_examples: int) -> Path:
6464
"""
6565
Creates a seed dataset from a path
66+
Args:
67+
chunks_jsonl_path (Path): Path to the chunks.jsonl file
68+
output_dir (Path): Path to output dir for select_chunks.jsonl
69+
num_seed_examples (int): Number of chunks user wishes to randomly select
70+
Returns:
71+
selected_chunks_file_path (pathlib.Path): Path to the generated seed example file
72+
"""
73+
if not chunks_jsonl_path.exists():
74+
raise ValueError(f"chunks.jsonl does not exist but should at {chunks_jsonl_path}")
75+
76+
chunks = []
77+
78+
with open(chunks_jsonl_path, 'r') as file: # khaled was here
79+
for line in file:
80+
chunk = json.loads(line)
81+
chunks.append(chunk)
82+
83+
selected_chunks = random.sample(chunks, num_seed_examples)
84+
85+
selected_chunks_file_path = output_dir / "selected_chunks.jsonl"
86+
with open(selected_chunks_file_path, "w", encoding="utf-8") as file:
87+
for chunk in selected_chunks:
88+
json.dump(chunk, file)
89+
file.write("\n")
90+
91+
return selected_chunks_file_path
92+
93+
def generate_seed_examples(contribution_name: str, chunks_jsonl_path: Path, output_dir: Path, api_key: str, api_url: str, model_id: str, domain: str, summary: str, customization_str: str | None = None) -> Path:
94+
"""
95+
Generates questions and answers per chunk via docling sdg. Saves them in an intermediate file
6696
Args:
6797
contribution_name (str): Name of the contribution
6898
chunks_jsonl_path (Path): Path to the chunks/chunks.jsonl file
6999
output_dir (Path): Path to output dir for the qna.yaml and intermediate outputs by docling-sdg
70-
contribution_metadata (dict): Dictionary with the domain and summary of the contribution
71-
num_seed_examples (str): Number of seed examples user wishes to generate for the contribution
72100
api_key (str): API key for the model used to generate questions and answers from contexts
73101
api_url (str): Endpoint for the model used to generate questions and answers from contexts
74102
model_id (str): Name of the model used to generate questions and answers from contexts
75103
customization_str (str | None) A directive for how to stylistically customize the generated QAs
76104
Returns:
77-
qna_output_path (pathlib.Path): Path to the generated seed example file
105+
qna_output_path (pathlib.Path): Path to a json file for generated questions and answers
78106
"""
79107
dataset = {}
80108
dataset[contribution_name] = {}
81109
dataset[contribution_name]["chunks"] = []
82110

83111
if not chunks_jsonl_path.exists():
84-
raise ValueError(f"chunks.jsonl does not exist but should at {chunks_jsonl_path}")
112+
raise ValueError(f"chunks file does not exist but should at {chunks_jsonl_path}")
85113

86114
docs = []
87115

@@ -103,14 +131,13 @@ def generate_seed_examples(contribution_name: str, chunks_jsonl_path: Path, outp
103131
docs.append(doc)
104132

105133
for doc in docs:
106-
print(f"Filtering smaller chunks out of document {doc['file']}")
134+
print(f"Filtering smaller chunks out of chunks from document {doc['file']}")
107135

108136
qa_chunks = get_qa_chunks(doc["file"], doc["chunk_objs"], chunk_filter)
109137
dataset[contribution_name]["chunks"].extend(list(qa_chunks))
110138

111139

112-
l = dataset[contribution_name]["chunks"]
113-
selected_chunks = random.sample(l, num_seed_examples)
140+
selected_chunks = dataset[contribution_name]["chunks"]
114141

115142
generate_options = GenerateOptions(project_id="project_id")
116143
generate_options.provider = LlmProvider.OPENAI_LIKE
@@ -171,6 +198,15 @@ def generate_seed_examples(contribution_name: str, chunks_jsonl_path: Path, outp
171198
return qna_output_path
172199

173200
def review_seed_examples_file(seed_examples_path: Path, min_seed_examples: int = 5, num_qa_pairs: int = 3) -> None:
201+
"""
202+
Review a seed example file has the expected number of fieldds
203+
Args:
204+
seed_examples_path (Path): Path to the qna.yaml file
205+
min_seed_example (int): Minimum number of expected seed examples
206+
num_qa_pairs (int): Number of expected question and answer pairs in a seed example
207+
Returns:
208+
None
209+
"""
174210
with open(seed_examples_path, 'r') as yaml_file:
175211
yaml_data = yaml.safe_load(yaml_file)
176212
errors = []
@@ -215,3 +251,27 @@ def review_seed_examples_file(seed_examples_path: Path, min_seed_examples: int =
215251
else:
216252
print(f"Seed Examples YAML {seed_examples_path.resolve()} is valid :)")
217253
print(f"\n")
254+
255+
256+
257+
def view_seed_example(qna_output_path: Path, seed_example_num: int) -> None:
258+
"""
259+
View a specific seed example in a qna.yaml
260+
Args:
261+
qna_output_path (Path): Path to the qna.yaml file
262+
seed_example_num (int): index of seed example to view
263+
Returns:
264+
None
265+
"""
266+
267+
with open(qna_output_path, "r") as yaml_file:
268+
yaml_data = yaml.safe_load(yaml_file)
269+
seed_examples = yaml_data.get('seed_examples')
270+
if seed_example_num >= len(seed_examples):
271+
raise ValueError(f"seed_example_num must be less than number of seed examples {len(seed_examples)}")
272+
seed_example = seed_examples[seed_example_num]
273+
print("Context:")
274+
print(f"{seed_example["context"]}\n")
275+
for qna in seed_example["questions_and_answers"]:
276+
print(f"Question: {qna["question"]}")
277+
print(f"Answer: {qna["answer"]}\n")

0 commit comments

Comments
 (0)