diff --git a/notebooks/instructlab-knowledge/instructlab-knowledge.ipynb b/notebooks/instructlab-knowledge/instructlab-knowledge.ipynb index 950549c..0a5e53b 100644 --- a/notebooks/instructlab-knowledge/instructlab-knowledge.ipynb +++ b/notebooks/instructlab-knowledge/instructlab-knowledge.ipynb @@ -1,765 +1,878 @@ { - "cells": [ - { - "cell_type": "markdown", - "id": "af99f876-0ffd-4079-aeb7-4cead05daaf4", - "metadata": {}, - "source": [ - "# 🐶 Data Pre-Processing: From source PDF to SDG-ready\n", - "\n", - "This notebook outlines the data pre-processing stages for knowledge contributions. A knowledge contribution consists of one or more PDF files that serve as the dataset for fine-tuning a model.\n", - "\n", - "At a high level the steps for the data pre-processing are:\n", - "\n", - "1. [Contribution Overview](#Contribution-Overview)\n", - "1. [Getting Started](#Getting-Started)\n", - "1. [Data Gathering](#Data-Gathering)\n", - "1. [Document Conversion](#Document-Conversion)\n", - "1. [Chunking](#Chunking)\n", - "1. [Authoring](#Authoring)\n", - "1. [Create Seed Dataset](#Create-Seed-Dataset-for-SDG)\n", - "\n", - "Each step occurs in order and produces outputs used in subsequent steps. The final step creates an SDG dataset that allows users to run the [SDG-Hub knowledge-generation notebook](https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub/blob/main/examples/knowledge_tuning/instructlab/knowledge_generation_and_mixing.ipynb) and generate samples.\n", - "\n", - "**NOTE**: Starting the notebook using Python 3.12 is recommended.\n", - "\n", - "\n", - "***" - ] - }, - { - "cell_type": "markdown", - "id": "03227e64-b5d7-4394-af30-530fc5baed2d", - "metadata": {}, - "source": [ - "## Contribution Overview" - ] - }, - { - "cell_type": "markdown", - "id": "1a008179-e734-4476-bfc2-a1e673efde79", - "metadata": { - "jp-MarkdownHeadingCollapsed": true - }, - "source": [ - "### What is a Contribution?\n", - "\n", - "To add knowledge to a model, a user groups source documents of that contain the knowledge into knowledge contributions. A knowledge contribution is made up of:\n", - "\n", - "1. One or more PDF documents that can be described by a contribution summary.\n", - "2. A contribution summary.\n", - "3. A contribution domain.\n", - "4. A unique name used to create a directory in the workspace for artifacts created by each step for the contribution.\n", - "\n", - "Once contributions are set up a user can go through the data pre-processing workflow.\n", - "\n", - "### What is a Contribution Summary?\n", - "\n", - "In the synthetic data generation step, a model (known as the teacher model) generates synthetic data based on the source document.\n", - "The contribution summary and domain are used in the prompts that are sent to the teacher model to create data.\n", - "\n", - "The document gets broken up into [chunks](#Chunking), and each chunk is in the prompt sent to the teacher model.\n", - "The contribution summary provides additional context to each chunk of a source document ensuring the teacher model has necessary background information.\n", - "\n", - "Contribution summaries should be specific, avoid acronyms or other vague references, and the represent the documents focus areas.\n", - "When a contribution includes many versions of the same document, publication dates, volume numbers, or any other identifiers to distinguish between versions should be included in the contribution summary.\n", - "\n", - "Here is an example of a contribution summary from a recent paper on [inference-time scaling](https://arxiv.org/pdf/2502.01618):\n", - "\n", - "```\n", - "\"A Probabilistic Inference Approach to Inference-Time Scaling of Large Language Models (LLMs)\"\n", - "```\n", - "\n", - "Since the title of the paper does a good job summaraizing the paper, the summary is based off the title but with the acronym LLM spelled out. \n", - "\n", - "Usually contributions only have one document. Contributions with multiple documents happen when the subject matter and format are similar among a group of documents. \n", - "\n", - "An example of a contribution having multiple documents would be the desire to teach a model an organization's bylaws over the years 2021, 2022, 2023, 2024, with a different PDF for each year.\n", - "\n", - "A contribution summary in this case might look like:\n", - "\n", - "`Bylaws of organization Foo from 2021 - 2024`\n", - "\n", - "In the case that there was only one source document from the year 2023, the contribution summary would be:\n", - "\n", - "`2023 Bylaws of organization Foo`\n", - "\n", - "Another example of having multiple documents within the same contribution would be if the documents had the same format. An example here could be grouping together a furniture company's instruction manuals. The format and layout of the instruction manuals would be the same across different pieces of furniture, but each manual covers different furniture.\n", - "\n", - "`Furniture company Foo's assembly instructions for tables, desks, and nightstands`\n", - "\n", - "If the contribution only contained a PDF for the assembly instructions for an oak dining table the summary would be:\n", - "\n", - "`Assembly instructions for furniture company Foo's oak dining table`\n", - "\n", - "### What is a Contribution Domain?\n", - "\n", - "A contribution's domain is the overarching subject or scope of the source document(s). The domain provides critical context to guide the teacher model in generating synthetic data that is relevant and grounded.\n", - "\n", - "The domain is brief and should not exceed 3 words, but should ideally be 1-2 words.\n", - "\n", - "To determine the domain, users should review document's primary subject and identify the main topic or purpose of the document.\n", - "Consider the intended use of the document and align it with the use case or audience. E.g. a tech manual for developers might fall under the “software development” domain.\n", - "\n", - "For the contribution summary examples discussed in the previous sections, domains could be `Artificial Intelligence Research`, `Bylaws`, and `Furniture Assembly`.\n", - "\n", - "**Note:** Different contributions can have the same domain" - ] - }, - { - "cell_type": "markdown", - "id": "0b02a66e-125e-47e6-9b6b-5f49d50990ca", - "metadata": {}, - "source": [ - "## Getting Started\n", - "\n", - "The first step in this notebook is to establish a workspace. Workspaces allow multi-tenancy or multiple different runs of this notebook. Without workspaces the results of each of the steps would be overwritten each time this notebook is executed.\n", - "\n", - "Users should change the `WORKSPACE_NAME` to suite their needs.\n", - "\n", - "> **NOTE:**\n", - "> If this notebook is ever run from the middle the following two cells need to be rerun to initialize variables used in every section." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0acd026f-65bd-4393-bb40-f8aa8bd6828b", - "metadata": {}, - "outputs": [], - "source": [ - "from pathlib import Path\n", - "\n", - "WORKSPACE_NAME = \"default\"\n", - "\n", - "WORKSPACE_ROOT = Path(\"workspaces\")\n", - "WORKSPACE_ROOT.mkdir(exist_ok=True)\n", - "\n", - "WORKSPACE_DIR = WORKSPACE_ROOT / WORKSPACE_NAME\n", - "WORKSPACE_DIR.mkdir(exist_ok=True)\n", - "\n", - "SOURCE_DOCUMENT_DIR = \"source_documents\"\n", - "CONVERSION_DIR = \"conversion\"\n", - "CHUNKING_DIR = \"chunking\"\n", - "AUTHORING_DIR = \"authoring\"" - ] - }, - { - "cell_type": "markdown", - "id": "412d5a43-4ec4-43e5-8f08-21aae6c69bfd", - "metadata": {}, - "source": [ - "To create contributions, define the `name` for the contribution, and the `domain` and `summary`. The `name`, `domain` and `summary` go into a dictionary called `knowledge_contribution` which gets added to a list called `contributions`.\n", - "\n", - "Once the list of `contributions` is set, a directory with each contribution name is created within the workspace and subdirectories for `source_documents`, `conversion`, `chunking`, `authoring` are created within the contribution name directory." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8b440e34-817c-4588-9a8c-790f74ec5dbb", - "metadata": {}, - "outputs": [], - "source": [ - "# Populated later on\n", - "contributions = []\n", - "\n", - "# Inference Time Scaling Contribution\n", - "contribution_name = \"inference-time-scaling\"\n", - "contribution_domain = \"Artificial Intelligence Research\" \n", - "contribution_summary = \"A Probabilistic Inference Approach to Inference-Time Scaling of Large Language Models (LLMs)\"\n", - "\n", - "# Add contribution information to the knowledge_contribution dictionary for it\n", - "knowledge_contribution = {\"name\": contribution_name, \"domain\": contribution_domain, \"summary\": contribution_summary}\n", - "contributions.append(knowledge_contribution)\n", - "\n", - "# NFL Rules Contribution\n", - "contribution2_name = \"nfl\"\n", - "contribution2_domain = \"sports rules\" \n", - "contribution2_summary = \"Official playing rules of the National Football League 2022, 2023\"\n", - "knowledge_contribution2 = {\"name\": contribution2_name, \"domain\": contribution2_domain, \"summary\": contribution2_summary}\n", - "contributions.append(knowledge_contribution2)\n", - "\n", - "for contribution in contributions:\n", - " contribution_dir = WORKSPACE_DIR / contribution[\"name\"]\n", - " contribution[\"dir\"] = contribution_dir\n", - "\n", - " for subdir in [SOURCE_DOCUMENT_DIR, CONVERSION_DIR, CHUNKING_DIR, AUTHORING_DIR]:\n", - " (contribution_dir / subdir).mkdir(parents=True, exist_ok=True)" - ] - }, - { - "cell_type": "markdown", - "id": "344b7ac5-fc2a-40a8-8e1f-e8dd8b1153e7", - "metadata": {}, - "source": [ - "## Data Gathering\n", - "\n", - "Copy each contribution file to the `WORKSPACE_DIR//source_documents` directory for the following conversion step to detect them." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "26501e2f-7215-441f-9efa-075f87024893", - "metadata": {}, - "outputs": [], - "source": [ - "import shutil\n", - "\n", - "# Inference Time Scaling Contribution\n", - "orig_path = Path(\"sample-pdfs/inference-time-scaling.pdf\")\n", - "dst_path = WORKSPACE_DIR / contribution_name / SOURCE_DOCUMENT_DIR\n", - "\n", - "shutil.copy(orig_path, dst_path)\n", - "\n", - "# NFL Rules Contribution\n", - "rules_2022 = Path(\"sample-pdfs/2022-nfl-rulebook.pdf\")\n", - "rules_2023 = Path(\"sample-pdfs/2023-nfl-rulebook.pdf\")\n", - "rules_dst = WORKSPACE_DIR / contribution2_name / SOURCE_DOCUMENT_DIR\n", - "\n", - "shutil.copy(rules_2022, rules_dst)\n", - "shutil.copy(rules_2023, rules_dst) " - ] - }, - { - "cell_type": "markdown", - "id": "68478061", - "metadata": {}, - "source": [ - "Review this list of files to verify that all expected files are included in each of the contributions." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5325fa71-d09f-457f-9e55-be106dcf78e0", - "metadata": {}, - "outputs": [], - "source": [ - "print(\"Files to pre-process\\n--------------------\")\n", - "for contribution in contributions:\n", - " print(f\"\\nContribution: {contribution.get(\"name\")}\")\n", - " print(\"Files:\")\n", - " files = list((contribution['dir'] / SOURCE_DOCUMENT_DIR).glob(\"*.pdf\"))\n", - " for file in files:\n", - " print(file.resolve())" - ] - }, - { - "cell_type": "markdown", - "id": "8a4904e6-8e12-4473-8301-cba90e61bd8b", - "metadata": {}, - "source": [ - "## Document Conversion\n", - "\n", - "This notebook uses [Docling](https://github.com/docling-project/docling) to convert any type of document into a Docling Document: a structured representation of the original document that can be exported as JSON. The resulting JSON output is used in the following step, which performs Docling's chunking methods." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b91d4b2e-19cd-46e7-a912-ba9b2904c7cd", - "metadata": {}, - "outputs": [], - "source": [ - "!pip install -qq docling" - ] - }, - { - "cell_type": "markdown", - "id": "749fb64b-d089-4844-9330-7f3639819e7a", - "metadata": {}, - "source": [ - "### Configure Docling conversion pipeline\n", - "\n", - "Next we set the configuration options for our conversion pipeline. The PDF Conversion options set here are the defaults. More information about pipeline configuration can be found on Docling.\n", - "\n", - "For a complete reference on Docling conversion pipeline configuration, see [PDFPipelineOptions](https://docling-project.github.io/docling/reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions) and [PDFFormatOptions](https://docling-project.github.io/docling/reference/document_converter/#docling.document_converter.InputFormat.XML_JATS)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "157c5e02-edd1-44f6-b20f-f6b4bda1aae7", - "metadata": {}, - "outputs": [], - "source": [ - "from docling.document_converter import DocumentConverter, PdfFormatOption\n", - "from docling.datamodel.base_models import InputFormat\n", - "from docling.datamodel.pipeline_options import PdfPipelineOptions\n", - "\n", - "pipeline_options = PdfPipelineOptions() # TODO: show the options that can be set\n", - "\n", - "doc_converter = DocumentConverter(\n", - " format_options={\n", - " InputFormat.PDF: PdfFormatOption(\n", - " pipeline_options=pipeline_options\n", - " )\n", - " }\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "73400c74-dead-4998-aee2-ddb00ddaa276", - "metadata": {}, - "source": [ - "Finally, we convert every document into Docling JSON as long as it is a valid file type to be converted" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a200039c-b8b2-4087-88ba-7bfb0e393cc9", - "metadata": {}, - "outputs": [], - "source": [ - "import json\n", - "\n", - "json_files=[]\n", - "for contribution in contributions:\n", - " files = list((contribution[\"dir\"] / SOURCE_DOCUMENT_DIR).glob(\"*.pdf\"))\n", - " \n", - " for file in files:\n", - " doc = doc_converter.convert(source=file).document\n", - " doc_dict = doc.export_to_dict()\n", - " \n", - " conversion_output_dir = contribution[\"dir\"] / CONVERSION_DIR\n", - " conversion_output_dir.mkdir(parents=True, exist_ok=True)\n", - " \n", - " json_output_path = conversion_output_dir / f\"{file.stem}.json\"\n", - " with open(json_output_path, \"w\") as f:\n", - " json.dump(doc_dict, f)\n", - " print(f\"Path of JSON output is: {Path(json_output_path).resolve()}\")\n", - " json_files.append(json_output_path.resolve())" - ] - }, - { - "cell_type": "markdown", - "id": "40710019-7ec9-414e-ad72-1ba672cf5fc2", - "metadata": {}, - "source": [ - "### Post-Conversion: Illuminator Analysis" - ] - }, - { - "cell_type": "markdown", - "id": "2572e2d0-94dc-4ca0-b032-3978af26c9c9", - "metadata": {}, - "source": [ - "The output of document conversion is not always perfect. Data may become distorted or corrupted, which can negatively affect a model's performance after training. While optional, reviewing your converted data is strongly recommended. The following example explains how to use the Illuminator tool to identify common conversion issues." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "09e07e35-befb-4ed5-9fe4-41544f88d943", - "metadata": {}, - "outputs": [], - "source": [ - "from utils.illuminator.analysis import analyze_docling_tables\n", - "from utils.illuminator.utils import generate_summary\n", - "from docling.datamodel.document import DoclingDocument\n", - "\n", - "import json\n", - "import sys\n", - "from pathlib import Path\n", - "\n", - "for contribution in contributions:\n", - " conversion_dir = contribution[\"dir\"] / CONVERSION_DIR\n", - " converted_json_paths = list(conversion_dir.glob(\"*.json\"))\n", - " results = {}\n", - " \n", - " for path in converted_json_paths:\n", - " with open(path, \"r\") as f:\n", - " doc_dict = json.load(f)\n", - " \n", - " doc = DoclingDocument(**doc_dict)\n", - " results[path] = analyze_docling_tables(doc)\n", - " \n", - " summary_path = contribution[\"dir\"] / CONVERSION_DIR / f\"illuminator-readable-summary-{doc.name}.txt\"\n", - " \n", - " with open(summary_path, \"w\") as f:\n", - " generate_summary(results, file=f)\n", - " \n", - " print(f\"✅ Post-conversion summary saved to: {summary_path.resolve()}\")" - ] - }, - { - "cell_type": "markdown", - "id": "eea0876e-ac55-45fc-93e8-3e646a6c3104", - "metadata": {}, - "source": [ - "\n", - "The output of this post-conversion step should help determine whether to avoid using the content for chunking entirely or to manually edit it before proceeding with chunking.\n" - ] - }, - { - "cell_type": "markdown", - "id": "cafad55e-a4c0-4d6e-9da0-49519fa9bf74", - "metadata": {}, - "source": [ - "## Chunking\n", - "\n", - "The goal of chunking the converted documents is to provide the teacher model small and logical pieces of the source document to generate data off of.\n", - "\n", - "In this notebook we are doing chunking with [Docling](https://docling-project.github.io/docling/examples/hybrid_chunking/#hybrid-chunking).\n", - "\n", - "The input to this notebook is a docling JSON file created after a docling conversion, or a directory of docling JSON files." - ] - }, - { - "cell_type": "markdown", - "id": "2482060c-a49f-4345-aa47-d54301939387", - "metadata": {}, - "source": [ - "### Initialize the Chunker\n", - "\n", - "Docling provides two chunkers, the `HierarchicalChunker` and the `HybridChunker`.\n", - "The `HierarchicalChunker` creates chunks based on the hierarchy in the Docling document\n", - "\n", - "The `HybridChunker` builds on the `HierarchicalChunker` and by making it tokenization aware.\n", - "\n", - "The `HybridChunker` has options for a `tokenizer`, the `max_tokens` in a chunk, and whether to merge undersized peer chunks. Uncomment the commented out code to configure these." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "50df9d91-add4-46a1-a69d-0f7f9f69542e", - "metadata": {}, - "outputs": [], - "source": [ - "#from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer\n", - "#from transformers import AutoTokenizer\n", - "\n", - "from docling.chunking import HybridChunker\n", - "\n", - "#EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\n", - "#MAX_TOKENS = 1024\n", - "#\n", - "# tokenizer = HuggingFaceTokenizer(\n", - "# tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),\n", - "# max_tokens=MAX_TOKENS, # optional, by default derived from `tokenizer` for HF case\n", - "# merge_peers=True # \n", - "# )\n", - "\n", - "chunker = HybridChunker(\n", - " #tokenizer=tokenizer,\n", - " #merge_peers=True, # whether to merge undersized chunks - defaults to True\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "54ce1d6f-b8d3-470c-b3c9-675911f0ee92", - "metadata": {}, - "source": [ - "### Load and chunk the converted docling document\n", - "\n", - "Next lets convert the document we want to chunk up into a Docling Document.\n", - "\n", - "The resulting chunks are stored in a file called chunks.jsonl in the `chunks` directory in your contribution. This file is used as an input in a later step when creating the seed dataset for SDG." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "db983c05-4aa6-4261-9283-2adab69bfbd3", - "metadata": {}, - "outputs": [], - "source": [ - "import json\n", - "from docling.document_converter import DocumentConverter\n", - "\n", - "all_chunks = []\n", - "\n", - "for contribution in contributions:\n", - " conversion_dir = contribution[\"dir\"] / CONVERSION_DIR\n", - " json_files = list(conversion_dir.glob(\"*.json\"))\n", - " chunking_output_dir = contribution[\"dir\"] / CHUNKING_DIR\n", - " chunking_output_dir.mkdir(parents=True, exist_ok=True)\n", - " contribution_chunks = []\n", - " \n", - " for file in json_files:\n", - " # reconvert the docling JSON for chunking\n", - " doc = DocumentConverter().convert(source=file)\n", - " \n", - " chunk_iter = chunker.chunk(dl_doc=doc.document)\n", - " chunk_objs = list(chunk_iter)\n", - " \n", - " print(f\"Extracted {len(chunk_objs)} chunks from {doc.document.name}\")\n", - " \n", - " for chunk in chunk_objs:\n", - " c = dict(chunk=chunker.contextualize(chunk=chunk), file=doc.document.name,metadata=chunk.meta.export_json_dict())\n", - " contribution_chunks.append(c)\n", - " all_chunks.append(c)\n", - "\n", - "\n", - " chunks_file_path = chunking_output_dir / \"chunks.jsonl\"\n", - " with open(chunks_file_path, \"w\", encoding=\"utf-8\") as file:\n", - " for chunk in contribution_chunks:\n", - " json.dump(chunk, file)\n", - " file.write(\"\\n\")\n", - " print(f\"Path of chunks JSON is: {Path(chunks_file_path).resolve()}\")" - ] - }, - { - "cell_type": "markdown", - "id": "0fb38545-eb84-4923-8fc4-d10ed08eab26", - "metadata": {}, - "source": [ - "### View the Chunks" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4fdf34c7-9829-43d2-bf9f-7d1d55bb6a4c", - "metadata": {}, - "outputs": [], - "source": [ - "chunk_gen = iter(all_chunks)" - ] - }, - { - "cell_type": "markdown", - "id": "811992ac", - "metadata": {}, - "source": [ - "To view the chunks one by one, rerun the following cell. The document is now broken into small sections with metadata about the chunk based on the document's format." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ee9a8531", - "metadata": {}, - "outputs": [], - "source": [ - "print(next(chunk_gen)['chunk'])" - ] - }, - { - "cell_type": "markdown", - "id": "a510f8c7-8cd3-4867-8742-9f4f9cda9e9f", - "metadata": {}, - "source": [ - "## Authoring\n", - "\n", - "To start the synthetic data generation process, users need to prepare a diverse set of questions and answers based off chunks from each source document. A chunk and question-and-answer pairs are called a seed example." - ] - }, - { - "cell_type": "markdown", - "id": "f3490c8a-5ee8-44cd-ae5e-26a6ca7b4017", - "metadata": {}, - "source": [ - "### Install docling-sdg\n", - "\n", - "[Docling-sdg](https://github.com/docling-project/docling-sdg) project is used to generate question and answer pairs for seed examples." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "86c48e52-cda7-48ac-84dc-0b844aed5f98", - "metadata": {}, - "outputs": [], - "source": [ - "!pip install -qq docling-sdg" - ] - }, - { - "cell_type": "markdown", - "id": "d65ec755-e3de-40ab-bf3a-23ebb29a705d", - "metadata": {}, - "source": [ - "### Initialize QA generator model & Number of Seed examples\n", - "\n", - "To generate seed examples you need to set: \n", - "1. The the Open AI compatible endpoint for the model generating question and answer pairs\n", - "2. The model's API key\n", - "3. The model's name\n", - "4. The number of seed example you wish to generate for each contribution" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "874d4de8", - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "\n", - "API_KEY = os.getenv(\"MODEL_API_KEY\") or \"\" # the API access key for your account (cannot be empty)\n", - "ENDPOINT_URL = os.getenv(\"MODEL_ENDPOINT_URL\") or \"\" # the URL of your model's API. URL can end in \"/v1\"\n", - "MODEL_NAME = os.getenv(\"MODEL_NAME\") or \"mistralai/Mixtral-8x7B-Instruct-v0.1\" # the name of your model\n", - "NUM_SEED_EXAMPLES = 7" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b702267e-f550-4bc2-bce4-c0fcecbbd292", - "metadata": {}, - "outputs": [], - "source": [ - "from utils.qna_gen import generate_seed_examples\n", - "\n", - "for contribution in contributions:\n", - " chunks_jsonl_path = contribution[\"dir\"] / CHUNKING_DIR / \"chunks.jsonl\"\n", - " authoring_path = contribution[\"dir\"] / AUTHORING_DIR\n", - "\n", - " qna_output_path = generate_seed_examples(contribution[\"name\"],\n", - " chunks_jsonl_path,\n", - " authoring_path,\n", - " contribution[\"domain\"],\n", - " contribution[\"summary\"],\n", - " NUM_SEED_EXAMPLES,\n", - " API_KEY,\n", - " ENDPOINT_URL,\n", - " MODEL_NAME)\n", - " print(f\"qna.yaml saved to: {qna_output_path}\")\n" - ] - }, - { - "cell_type": "markdown", - "id": "6c574f96-5860-48b9-b4ac-01d367c7717b", - "metadata": {}, - "source": [ - "### Review and Revise Seed Examples\n", - "\n", - "A quality set of seed examples has diverse contexts and question-and-answer pairs across every seed example. You can asses the `qna.yaml` files in your preferred text editor to ensure the quality, diversity, and style of generated questions and answers, and modify them accordingly.\n", - "\n", - "After assessment, the `qna.yaml` files can be quickly reviewed to ensure they includes the required elements and correct number of each. It is recommended to have at least 5 seed examples. Each seed example must have 3 question and answer pairs." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "da3ef131-e5a3-4854-b6e9-3277273a91dd", - "metadata": {}, - "outputs": [], - "source": [ - "from utils.qna_gen import review_seed_examples_file\n", - "\n", - "\n", - "\n", - "for contribution in contributions:\n", - " qna_path = contribution[\"dir\"] / AUTHORING_DIR / \"qna.yaml\"\n", - " review_seed_examples_file(qna_path, min_seed_examples=5, num_qa_pairs=3)" - ] - }, - { - "cell_type": "markdown", - "id": "1f101076-a50f-49ea-a83b-46eaa8b39cc4", - "metadata": {}, - "source": [ - "## Create Seed Dataset for SDG\n", - "\n", - "This step creates the seed data for SDG. This data is a JSON filed that contains a combination of the `seed_examples` in the qna.yaml and the chunks from the source document. \n", - "\n", - "Intermediate seed data files are created for each contribution with the contribution's name included in the file name. For example in the `nfl` contribution, a file containing seed data called `seed_data-nfl.jsonl` would be created in `$WORKSPACE_DIR/nfl`. This file contains a combination of all of the chunks from the NFL source documents and the seed examples in the `qna.yaml` in `$WORKSPACE_DIR/nfl/authoring`.\n", - "\n", - "After seed data files are created for each contribution, a final `seed_data.jsonl` is created in `$WORKSPACE_DIR`. This file is a concatenation of all of the intermediate `seed_data-{contribution name}.jsonl` files and should be used as an input to SDG." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e2c6e31b-e8a9-406c-b2dc-27433c8fd8ec", - "metadata": {}, - "outputs": [], - "source": [ - "!pip install -qq datasets transformers" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ab2c9ed2-8ba8-4959-8e01-81625b81d286", - "metadata": {}, - "outputs": [], - "source": [ - "from utils.create_seed_dataset import get_seed_dataset, safe_concatenate_datasets\n", - "\n", - "contribution_datasets = []\n", - "for contribution in contributions:\n", - " chunks_dir = contribution[\"dir\"] / CHUNKING_DIR\n", - " qna_dir = contribution[\"dir\"] / AUTHORING_DIR\n", - " seed_data = get_seed_dataset(chunks_dir, qna_dir)\n", - " output_path = f'{contribution_dir}/seed_data-{contribution_name}.jsonl'\n", - " seed_data.to_json(output_path, orient='records', lines=True)\n", - " contribution_datasets.append(seed_data)\n", - " print(f\"Intermediate results saved to: {output_path}\")\n", - "\n", - "final_seed_data = safe_concatenate_datasets(contribution_datasets)\n", - "output_path = f'{WORKSPACE_DIR}/seed_data.jsonl'\n", - "final_seed_data.to_json(output_path, orient='records', lines=True)\n", - "\n", - "print(f\"Final seed data contains {final_seed_data.data.num_rows} rows\")\n", - "print(f\"Final seed data for SDG saved to: {output_path}\")" - ] - }, - { - "cell_type": "markdown", - "id": "50ff36f4-19fc-4a27-b51a-3688e7b630e4", - "metadata": {}, - "source": [ - "### Inspect the seed data" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a6936825-31c1-4b46-a1af-2fb46f50158d", - "metadata": {}, - "outputs": [], - "source": [ - "print(seed_data.data.table.slice(length=1))" - ] - }, - { - "cell_type": "markdown", - "id": "24a8fcdb-8035-4f30-b856-46afe9f928a1", - "metadata": {}, - "source": [ - "# Summary\n", - "\n", - "To recap, given source documents in PDF format, this notebook:\n", - "\n", - "1. Converts the documents using Docling and saves in the Docling Document format\n", - "2. Splits the extracted text into chunks of JSON\n", - "3. Generates Q&A pairs for a subset of those chunks\n", - "4. Creates a `qna.yaml` available for inspection and revision\n", - "5. Combines the chunks and `qna.yaml` to create a `seed_data.jsonl` to use for SDG\n", - "\n", - "The next step is to use the resulting `seed_data.jsonl` for SDG, such as illustrated in [this notebook](https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub/blob/main/examples/instructlab/knowledge/knowledge_generation_and_mixing.ipynb)." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.10" - } - }, - "nbformat": 4, - "nbformat_minor": 5 + "cells": [ + { + "cell_type": "markdown", + "id": "af99f876-0ffd-4079-aeb7-4cead05daaf4", + "metadata": {}, + "source": [ + "# 🐶 Data Pre-Processing: From source PDF to SDG-ready\n", + "\n", + "This notebook outlines the data pre-processing stages for knowledge contributions. A knowledge contribution consists of one or more PDF files that serve as the dataset for fine-tuning a model.\n", + "\n", + "At a high level the steps for the data pre-processing are:\n", + "\n", + "1. [Contribution Overview](#Contribution-Overview)\n", + "1. [Getting Started](#Getting-Started)\n", + "1. [Data Gathering](#Data-Gathering)\n", + "1. [Document Conversion](#Document-Conversion)\n", + "1. [Chunking](#Chunking)\n", + "1. [Authoring](#Authoring)\n", + "1. [Create Seed Dataset](#Create-Seed-Dataset-for-SDG)\n", + "\n", + "Each step occurs in order and produces outputs used in subsequent steps. The final step creates an SDG dataset that allows users to run the [SDG-Hub knowledge-generation notebook](https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub/blob/main/examples/knowledge_tuning/instructlab/knowledge_generation_and_mixing.ipynb) and generate samples.\n", + "\n", + "**NOTE**: Starting the notebook using Python 3.12 is recommended.\n", + "\n", + "\n", + "***" + ] + }, + { + "cell_type": "markdown", + "id": "03227e64-b5d7-4394-af30-530fc5baed2d", + "metadata": {}, + "source": [ + "## Contribution Overview" + ] + }, + { + "cell_type": "markdown", + "id": "1a008179-e734-4476-bfc2-a1e673efde79", + "metadata": { + "jp-MarkdownHeadingCollapsed": true + }, + "source": [ + "### What is a Contribution?\n", + "\n", + "To add knowledge to a model, a user groups source documents of that contain the knowledge into knowledge contributions. A knowledge contribution is made up of:\n", + "\n", + "1. One or more PDF documents that can be described by a contribution summary.\n", + "2. A contribution summary.\n", + "3. A contribution domain.\n", + "4. A unique name used to create a directory in the workspace for artifacts created by each step for the contribution.\n", + "\n", + "Once contributions are set up a user can go through the data pre-processing workflow.\n", + "\n", + "### What is a Contribution Summary?\n", + "\n", + "In the synthetic data generation step, a model (known as the teacher model) generates synthetic data based on the source document.\n", + "The contribution summary and domain are used in the prompts that are sent to the teacher model to create data.\n", + "\n", + "The document gets broken up into [chunks](#Chunking), and each chunk is in the prompt sent to the teacher model.\n", + "The contribution summary provides additional context to each chunk of a source document ensuring the teacher model has necessary background information.\n", + "\n", + "Contribution summaries should be specific, avoid acronyms or other vague references, and the represent the documents focus areas.\n", + "When a contribution includes many versions of the same document, publication dates, volume numbers, or any other identifiers to distinguish between versions should be included in the contribution summary.\n", + "\n", + "Here is an example of a contribution summary from a recent paper on [inference-time scaling](https://arxiv.org/pdf/2502.01618):\n", + "\n", + "```\n", + "\"A Probabilistic Inference Approach to Inference-Time Scaling of Large Language Models (LLMs)\"\n", + "```\n", + "\n", + "Since the title of the paper does a good job summaraizing the paper, the summary is based off the title but with the acronym LLM spelled out. \n", + "\n", + "Usually contributions only have one document. Contributions with multiple documents happen when the subject matter and format are similar among a group of documents. \n", + "\n", + "An example of a contribution having multiple documents would be the desire to teach a model an organization's bylaws over the years 2021, 2022, 2023, 2024, with a different PDF for each year.\n", + "\n", + "A contribution summary in this case might look like:\n", + "\n", + "`Bylaws of organization Foo from 2021 - 2024`\n", + "\n", + "In the case that there was only one source document from the year 2023, the contribution summary would be:\n", + "\n", + "`2023 Bylaws of organization Foo`\n", + "\n", + "Another example of having multiple documents within the same contribution would be if the documents had the same format. An example here could be grouping together a furniture company's instruction manuals. The format and layout of the instruction manuals would be the same across different pieces of furniture, but each manual covers different furniture.\n", + "\n", + "`Furniture company Foo's assembly instructions for tables, desks, and nightstands`\n", + "\n", + "If the contribution only contained a PDF for the assembly instructions for an oak dining table the summary would be:\n", + "\n", + "`Assembly instructions for furniture company Foo's oak dining table`\n", + "\n", + "### What is a Contribution Domain?\n", + "\n", + "A contribution's domain is the overarching subject or scope of the source document(s). The domain provides critical context to guide the teacher model in generating synthetic data that is relevant and grounded.\n", + "\n", + "The domain is brief and should not exceed 3 words, but should ideally be 1-2 words.\n", + "\n", + "To determine the domain, users should review document's primary subject and identify the main topic or purpose of the document.\n", + "Consider the intended use of the document and align it with the use case or audience. E.g. a tech manual for developers might fall under the “software development” domain.\n", + "\n", + "For the contribution summary examples discussed in the previous sections, domains could be `Artificial Intelligence Research`, `Bylaws`, and `Furniture Assembly`.\n", + "\n", + "**Note:** Different contributions can have the same domain" + ] + }, + { + "cell_type": "markdown", + "id": "0b02a66e-125e-47e6-9b6b-5f49d50990ca", + "metadata": {}, + "source": [ + "## Getting Started\n", + "\n", + "The first step in this notebook is to establish a workspace. Workspaces allow multi-tenancy or multiple different runs of this notebook. Without workspaces the results of each of the steps would be overwritten each time this notebook is executed.\n", + "\n", + "Users should change the `WORKSPACE_NAME` to suite their needs.\n", + "\n", + "> **NOTE:**\n", + "> If this notebook is ever run from the middle the following two cells need to be rerun to initialize variables used in every section." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0acd026f-65bd-4393-bb40-f8aa8bd6828b", + "metadata": {}, + "outputs": [], + "source": [ + "from pathlib import Path\n", + "\n", + "WORKSPACE_NAME = \"default\"\n", + "\n", + "WORKSPACE_ROOT = Path(\"workspaces\")\n", + "WORKSPACE_ROOT.mkdir(exist_ok=True)\n", + "\n", + "WORKSPACE_DIR = WORKSPACE_ROOT / WORKSPACE_NAME\n", + "WORKSPACE_DIR.mkdir(exist_ok=True)\n", + "\n", + "SOURCE_DOCUMENT_DIR = \"source_documents\"\n", + "CONVERSION_DIR = \"conversion\"\n", + "CHUNKING_DIR = \"chunking\"\n", + "AUTHORING_DIR = \"authoring\"" + ] + }, + { + "cell_type": "markdown", + "id": "412d5a43-4ec4-43e5-8f08-21aae6c69bfd", + "metadata": {}, + "source": [ + "To create contributions, define the `name` for the contribution, and the `domain` and `summary`. The `name`, `domain` and `summary` go into a dictionary called `knowledge_contribution` which gets added to a list called `contributions`.\n", + "\n", + "Once the list of `contributions` is set, a directory with each contribution name is created within the workspace and subdirectories for `source_documents`, `conversion`, `chunking`, `authoring` are created within the contribution name directory." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8b440e34-817c-4588-9a8c-790f74ec5dbb", + "metadata": {}, + "outputs": [], + "source": [ + "# Populated later on\n", + "contributions = []\n", + "\n", + "# Inference Time Scaling Contribution\n", + "contribution_name = \"inference-time-scaling\"\n", + "contribution_domain = \"Artificial Intelligence Research\" \n", + "contribution_summary = \"A Probabilistic Inference Approach to Inference-Time Scaling of Large Language Models (LLMs)\"\n", + "\n", + "# Add contribution information to the knowledge_contribution dictionary for it\n", + "knowledge_contribution = {\"name\": contribution_name, \"domain\": contribution_domain, \"summary\": contribution_summary}\n", + "contributions.append(knowledge_contribution)\n", + "\n", + "# NFL Rules Contribution\n", + "contribution2_name = \"nfl\"\n", + "contribution2_domain = \"sports rules\" \n", + "contribution2_summary = \"Official playing rules of the National Football League 2022, 2023\"\n", + "knowledge_contribution2 = {\"name\": contribution2_name, \"domain\": contribution2_domain, \"summary\": contribution2_summary}\n", + "contributions.append(knowledge_contribution2)\n", + "\n", + "for contribution in contributions:\n", + " contribution_dir = WORKSPACE_DIR / contribution[\"name\"]\n", + " contribution[\"dir\"] = contribution_dir\n", + "\n", + " for subdir in [SOURCE_DOCUMENT_DIR, CONVERSION_DIR, CHUNKING_DIR, AUTHORING_DIR]:\n", + " (contribution_dir / subdir).mkdir(parents=True, exist_ok=True)" + ] + }, + { + "cell_type": "markdown", + "id": "344b7ac5-fc2a-40a8-8e1f-e8dd8b1153e7", + "metadata": {}, + "source": [ + "## Data Gathering\n", + "\n", + "Copy each contribution file to the `WORKSPACE_DIR//source_documents` directory for the following conversion step to detect them." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "26501e2f-7215-441f-9efa-075f87024893", + "metadata": {}, + "outputs": [], + "source": [ + "import shutil\n", + "\n", + "# Inference Time Scaling Contribution\n", + "orig_path = Path(\"sample-pdfs/inference-time-scaling.pdf\")\n", + "dst_path = WORKSPACE_DIR / contribution_name / SOURCE_DOCUMENT_DIR\n", + "\n", + "shutil.copy(orig_path, dst_path)\n", + "\n", + "# NFL Rules Contribution\n", + "rules_2022 = Path(\"sample-pdfs/2022-nfl-rulebook.pdf\")\n", + "rules_2023 = Path(\"sample-pdfs/2023-nfl-rulebook.pdf\")\n", + "rules_dst = WORKSPACE_DIR / contribution2_name / SOURCE_DOCUMENT_DIR\n", + "\n", + "shutil.copy(rules_2022, rules_dst)\n", + "shutil.copy(rules_2023, rules_dst) " + ] + }, + { + "cell_type": "markdown", + "id": "68478061", + "metadata": {}, + "source": [ + "Review this list of files to verify that all expected files are included in each of the contributions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5325fa71-d09f-457f-9e55-be106dcf78e0", + "metadata": {}, + "outputs": [], + "source": [ + "print(\"Files to pre-process\n--------------------\")\n", + "for contribution in contributions:\n", + " print(f\"\nContribution: {contribution.get(\"name\")}\")\n", + " print(\"Files:\")\n", + " files = list((contribution['dir'] / SOURCE_DOCUMENT_DIR).glob(\"*.pdf\"))\n", + " for file in files:\n", + " print(file.resolve())" + ] + }, + { + "cell_type": "markdown", + "id": "8a4904e6-8e12-4473-8301-cba90e61bd8b", + "metadata": {}, + "source": [ + "## Document Conversion\n", + "\n", + "This notebook uses [Docling](https://github.com/docling-project/docling) to convert any type of document into a Docling Document: a structured representation of the original document that can be exported as JSON. The resulting JSON output is used in the following step, which performs Docling's chunking methods." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b91d4b2e-19cd-46e7-a912-ba9b2904c7cd", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install -qq docling" + ] + }, + { + "cell_type": "markdown", + "id": "749fb64b-d089-4844-9330-7f3639819e7a", + "metadata": {}, + "source": [ + "### Configure Docling conversion pipeline\n", + "\n", + "Next we set the configuration options for our conversion pipeline. The PDF Conversion options set here are the defaults. More information about pipeline configuration can be found on Docling.\n", + "\n", + "For a complete reference on Docling conversion pipeline configuration, see [PDFPipelineOptions](https://docling-project.github.io/docling/reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions) and [PDFFormatOptions](https://docling-project.github.io/docling/reference/document_converter/#docling.document_converter.InputFormat.XML_JATS)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "157c5e02-edd1-44f6-b20f-f6b4bda1aae7", + "metadata": {}, + "outputs": [], + "source": [ + "from docling.document_converter import DocumentConverter, PdfFormatOption\n", + "from docling.datamodel.base_models import InputFormat\n", + "from docling.datamodel.pipeline_options import PdfPipelineOptions\n", + "\n", + "pipeline_options = PdfPipelineOptions() # TODO: show the options that can be set\n", + "\n", + "doc_converter = DocumentConverter(\n", + " format_options={\n", + " InputFormat.PDF: PdfFormatOption(\n", + " pipeline_options=pipeline_options\n", + " )\n", + " }\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "73400c74-dead-4998-aee2-ddb00ddaa276", + "metadata": {}, + "source": [ + "Finally, we convert every document into Docling JSON as long as it is a valid file type to be converted" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a200039c-b8b2-4087-88ba-7bfb0e393cc9", + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "\n", + "json_files=[]\n", + "for contribution in contributions:\n", + " files = list((contribution[\"dir\"] / SOURCE_DOCUMENT_DIR).glob(\"*.pdf\"))\n", + " \n", + " for file in files:\n", + " doc = doc_converter.convert(source=file).document\n", + " doc_dict = doc.export_to_dict()\n", + " \n", + " conversion_output_dir = contribution[\"dir\"] / CONVERSION_DIR\n", + " conversion_output_dir.mkdir(parents=True, exist_ok=True)\n", + " \n", + " json_output_path = conversion_output_dir / f\"{file.stem}.json\"\n", + " with open(json_output_path, \"w\") as f:\n", + " json.dump(doc_dict, f)\n", + " print(f\"Path of JSON output is: {Path(json_output_path).resolve()}\")\n", + " json_files.append(json_output_path.resolve())" + ] + }, + { + "cell_type": "markdown", + "id": "40710019-7ec9-414e-ad72-1ba672cf5fc2", + "metadata": {}, + "source": [ + "### Post-Conversion: Illuminator Analysis" + ] + }, + { + "cell_type": "markdown", + "id": "2572e2d0-94dc-4ca0-b032-3978af26c9c9", + "metadata": {}, + "source": [ + "The output of document conversion is not always perfect. Data may become distorted or corrupted, which can negatively affect a model's performance after training. While optional, reviewing your converted data is strongly recommended. The following example explains how to use the Illuminator tool to identify common conversion issues." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "09e07e35-befb-4ed5-9fe4-41544f88d943", + "metadata": {}, + "outputs": [], + "source": [ + "from utils.illuminator.analysis import analyze_docling_tables\n", + "from utils.illuminator.utils import generate_summary\n", + "from docling.datamodel.document import DoclingDocument\n", + "\n", + "import json\n", + "import sys\n", + "from pathlib import Path\n", + "\n", + "for contribution in contributions:\n", + " conversion_dir = contribution[\"dir\"] / CONVERSION_DIR\n", + " converted_json_paths = list(conversion_dir.glob(\"*.json\"))\n", + " results = {}\n", + " \n", + " for path in converted_json_paths:\n", + " with open(path, \"r\") as f:\n", + " doc_dict = json.load(f)\n", + " \n", + " doc = DoclingDocument(**doc_dict)\n", + " results[path] = analyze_docling_tables(doc)\n", + " \n", + " summary_path = contribution[\"dir\"] / CONVERSION_DIR / f\"illuminator-readable-summary-{doc.name}.txt\"\n", + " \n", + " with open(summary_path, \"w\") as f:\n", + " generate_summary(results, file=f)\n", + " \n", + " print(f\"✅ Post-conversion summary saved to: {summary_path.resolve()}\")" + ] + }, + { + "cell_type": "markdown", + "id": "eea0876e-ac55-45fc-93e8-3e646a6c3104", + "metadata": {}, + "source": [ + "\n", + "The output of this post-conversion step should help determine whether to avoid using the content for chunking entirely or to manually edit it before proceeding with chunking.\n" + ] + }, + { + "cell_type": "markdown", + "id": "cafad55e-a4c0-4d6e-9da0-49519fa9bf74", + "metadata": {}, + "source": [ + "## Chunking\n", + "\n", + "The goal of chunking the converted documents is to provide the teacher model small and logical pieces of the source document to generate data off of.\n", + "\n", + "In this notebook we are doing chunking with [Docling](https://docling-project.github.io/docling/examples/hybrid_chunking/#hybrid-chunking).\n", + "\n", + "The input to this notebook is a docling JSON file created after a docling conversion, or a directory of docling JSON files." + ] + }, + { + "cell_type": "markdown", + "id": "2482060c-a49f-4345-aa47-d54301939387", + "metadata": {}, + "source": [ + "### Initialize the Chunker\n", + "\n", + "Docling provides two chunkers, the `HierarchicalChunker` and the `HybridChunker`.\n", + "The `HierarchicalChunker` creates chunks based on the hierarchy in the Docling document\n", + "\n", + "The `HybridChunker` builds on the `HierarchicalChunker` and by making it tokenization aware.\n", + "\n", + "The `HybridChunker` has options for a `tokenizer`, the `max_tokens` in a chunk, and whether to merge undersized peer chunks. Uncomment the commented out code to configure these." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "50df9d91-add4-46a1-a69d-0f7f9f69542e", + "metadata": {}, + "outputs": [], + "source": [ + "#from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer\n", + "#from transformers import AutoTokenizer\n", + "\n", + "from docling.chunking import HybridChunker\n", + "\n", + "#EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\n", + "#MAX_TOKENS = 1024\n", + "#\n", + "# tokenizer = HuggingFaceTokenizer(\n", + "# tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),\n", + "# max_tokens=MAX_TOKENS, # optional, by default derived from `tokenizer` for HF case\n", + "# merge_peers=True # \n", + "# )\n", + "\n", + "chunker = HybridChunker(\n", + " #tokenizer=tokenizer,\n", + " #merge_peers=True, # whether to merge undersized chunks - defaults to True\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "54ce1d6f-b8d3-470c-b3c9-675911f0ee92", + "metadata": {}, + "source": [ + "### Load and chunk the converted docling document\n", + "\n", + "Next lets convert the document we want to chunk up into a Docling Document.\n", + "\n", + "The resulting chunks are stored in a file called chunks.jsonl in the `chunks` directory in your contribution. This file is used as an input in a later step when creating the seed dataset for SDG." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "db983c05-4aa6-4261-9283-2adab69bfbd3", + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "from docling.document_converter import DocumentConverter\n", + "\n", + "all_chunks = []\n", + "\n", + "for contribution in contributions:\n", + " conversion_dir = contribution[\"dir\"] / CONVERSION_DIR\n", + " json_files = list(conversion_dir.glob(\"*.json\"))\n", + " chunking_output_dir = contribution[\"dir\"] / CHUNKING_DIR\n", + " chunking_output_dir.mkdir(parents=True, exist_ok=True)\n", + " contribution_chunks = []\n", + " \n", + " for file in json_files:\n", + " # reconvert the docling JSON for chunking\n", + " doc = DocumentConverter().convert(source=file)\n", + " \n", + " chunk_iter = chunker.chunk(dl_doc=doc.document)\n", + " chunk_objs = list(chunk_iter)\n", + " \n", + " print(f\"Extracted {len(chunk_objs)} chunks from {doc.document.name}\")\n", + " \n", + " for chunk in chunk_objs:\n", + " c = dict(chunk=chunker.contextualize(chunk=chunk), file=doc.document.name,metadata=chunk.meta.export_json_dict())\n", + " contribution_chunks.append(c)\n", + " all_chunks.append(c)\n", + "\n", + "\n", + " chunks_file_path = chunking_output_dir / \"chunks.jsonl\"\n", + " with open(chunks_file_path, \"w\", encoding=\"utf-8\") as file:\n", + " for chunk in contribution_chunks:\n", + " json.dump(chunk, file)\n", + " file.write(\"\n\")\n", + " print(f\"Path of chunks JSON is: {Path(chunks_file_path).resolve()}\")" + ] + }, + { + "cell_type": "markdown", + "id": "0fb38545-eb84-4923-8fc4-d10ed08eab26", + "metadata": {}, + "source": [ + "### View the Chunks" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4fdf34c7-9829-43d2-bf9f-7d1d55bb6a4c", + "metadata": {}, + "outputs": [], + "source": [ + "chunk_gen = iter(all_chunks)" + ] + }, + { + "cell_type": "markdown", + "id": "811992ac", + "metadata": {}, + "source": [ + "To view the chunks one by one, rerun the following cell. The document is now broken into small sections with metadata about the chunk based on the document's format." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ee9a8531", + "metadata": {}, + "outputs": [], + "source": [ + "print(next(chunk_gen)['chunk'])" + ] + }, + { + "cell_type": "markdown", + "id": "a510f8c7-8cd3-4867-8742-9f4f9cda9e9f", + "metadata": {}, + "source": [ + "## Authoring\n", + "\n", + "To start the synthetic data generation process, users need to prepare a diverse set of questions and answers based off chunks from each source document. A chunk and question-and-answer pairs are called a seed example." + ] + }, + { + "cell_type": "markdown", + "id": "f3490c8a-5ee8-44cd-ae5e-26a6ca7b4017", + "metadata": {}, + "source": [ + "### Install docling-sdg\n", + "\n", + "[Docling-sdg](https://github.com/docling-project/docling-sdg) project is used to generate question and answer pairs for seed examples." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "86c48e52-cda7-48ac-84dc-0b844aed5f98", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install -qq docling-sdg" + ] + }, + { + "cell_type": "markdown", + "id": "d65ec755-e3de-40ab-bf3a-23ebb29a705d", + "metadata": {}, + "source": [ + "### Initialize QA generator model & Number of Seed examples\n", + "\n", + "To generate seed examples you need to set: \n", + "1. The the Open AI compatible endpoint for the model generating question and answer pairs\n", + "2. The model's API key\n", + "3. The model's name\n", + "4. The number of seed example you wish to generate for each contribution" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "874d4de8", + "metadata": { + "tags": [ + "parameters" + ] + }, + "outputs": [], + "source": [ + "API_KEY = \"none\" # the API access key for your account ( cannot be empty )\n", + "API_URL = \"http://127.0.0.1:11434/v1\" # the URL of your model's API\n", + "MODEL_ID = \"granite3.3\" # the name of your model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b702267e-f550-4bc2-bce4-c0fcecbbd292", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "generate_options = GenerateOptions(project_id=\"project_id\")\n", + "generate_options.provider = LlmProvider.OPENAI_LIKE\n", + "generate_options.api_key = SecretStr(API_KEY)\n", + "generate_options.url = API_URL\n", + "generate_options.model_id = MODEL_ID" + ] + }, + { + "cell_type": "markdown", + "id": "32e13a94-1c5e-4310-9500-6940368ec2ea", + "metadata": {}, + "source": [ + "### [OPTIONAL] Prompt customization for Q&A Generation\n", + "\n", + "The below cell modifies the default prompt used by `docling-sdg` for Q&A Generation by adding a cutomization statement. \n", + "\n", + "Insert your own stylistic customization statement below and run the rest of the cells in this section if you would like to stylistically customize Q&A generation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "78b51f57-7b7b-4d53-a129-29c291939dae", + "metadata": {}, + "outputs": [], + "source": [ + "customization_str = \"Write at the fifth grade level.\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d6e8dba7-8798-429b-9f46-806111ce6e6c", + "metadata": {}, + "outputs": [], + "source": [ + "from docling_sdg.qa.prompts.generation_prompts import QaPromptTemplate\n", + "\n", + "\n", + "CUSTOM_COMBINED_QUESTION_PROMPT = (\n", + " \"I will provide you a text passage. I need you to generate three questions that \"\n", + " \"must be answered only with information contained in this passage, and nothing \"\n", + " \"else.\n\"\n", + " 'The first question is of type \"fact_single\", which means that the answer to this '\n", + " \"question is a simple, single piece of factual information contained in the \"\n", + " \"context.\n\"\n", + " 'The second question is of type \"summary\", which means that the answer to this '\n", + " \"question summarizes different pieces of factual information contained in the \"\n", + " \"context.\n\"\n", + " 'The third question is of type \"reasoning\", which is a question that requires the '\n", + " \"reader to think critically and make an inference or draw a conclusion based on \"\n", + " \"the information provided in the passage.\n\"\n", + " \"Make sure that the three questions are different.\n\"\n", + " \"\n\"\n", + " \"You will format your generation as a python dictionary, such as:\n\"\n", + " '{\"fact_single\": , '\n", + " '\"fact_single_answer: , \"summary\": , \"summary_answer\": , \"reasoning\": , \"reasoning_answer\": }\n'\n", + " \"\n\"\n", + " \"Only provide the python dictionary as your output. Make sure you provide an answer for each question.\n\"\n", + " \"{customization_str}\"\n", + " \"\n\"\n", + " \"Context: {context_str}\"\n", + ")\n", + "\n", + "custom_combined_question_qa_prompt: QaPromptTemplate = QaPromptTemplate(\n", + " template=CUSTOM_COMBINED_QUESTION_PROMPT,\n", + " keys=[\"context_str\", \"customization_str\"],\n", + " labels=[\"fact_single\", \"summary\", \"reasoning\"],\n", + " type_=\"question\",\n", + ")\n", + "\n", + "generate_options.prompts = [custom_combined_question_qa_prompt]" + ] + }, + { + "cell_type": "markdown", + "id": "919199c0-3747-409a-85ab-0155ef3ebe9d", + "metadata": {}, + "source": [ + "### Configure subset selection" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f1197d4e-8354-45e3-9ec9-85c78ba36548", + "metadata": {}, + "outputs": [], + "source": [ + "NUM_CHUNKS_TO_SELECT_FOR_AUTHORING = 5" + ] + }, + { + "cell_type": "markdown", + "id": "d2421d07-3e6c-4355-95f4-da8e157557c7", + "metadata": {}, + "source": [ + "### Run QA generation on selected chunks" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e57edff5-9a13-47fb-9248-9140ae5baaca", + "metadata": {}, + "outputs": [], + "source": [ + "from utils.qna_gen import generate_seed_examples\n", + "\n", + "for contribution in contributions:\n", + " chunks_jsonl_path = contribution[\"dir\"] / CHUNKING_DIR / \"chunks.jsonl\"\n", + " authoring_path = contribution[\"dir\"] / AUTHORING_DIR\n", + "\n", + " qna_output_path = generate_seed_examples(contribution[\"name\"],\n", + " chunks_jsonl_path,\n", + " authoring_path,\n", + " contribution[\"domain\"],\n", + " contribution[\"summary\"],\n", + " NUM_SEED_EXAMPLES,\n", + " API_KEY,\n", + " ENDPOINT_URL,\n", + " MODEL_NAME)\n", + " print(f\"qna.yaml saved to: {qna_output_path}\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "6c574f96-5860-48b9-b4ac-01d367c7717b", + "metadata": {}, + "source": [ + "### Review and Revise Seed Examples\n", + "\n", + "A quality set of seed examples has diverse contexts and question-and-answer pairs across every seed example. You can asses the `qna.yaml` files in your preferred text editor to ensure the quality, diversity, and style of generated questions and answers, and modify them accordingly.\n", + "\n", + "After assessment, the `qna.yaml` files can be quickly reviewed to ensure they includes the required elements and correct number of each. It is recommended to have at least 5 seed examples. Each seed example must have 3 question and answer pairs." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "da3ef131-e5a3-4854-b6e9-3277273a91dd", + "metadata": {}, + "outputs": [], + "source": [ + "from utils.qna_gen import review_seed_examples_file\n", + "\n", + "\n", + "\n", + "for contribution in contributions:\n", + " qna_path = contribution[\"dir\"] / AUTHORING_DIR / \"qna.yaml\"\n", + " review_seed_examples_file(qna_path, min_seed_examples=5, num_qa_pairs=3)" + ] + }, + { + "cell_type": "markdown", + "id": "1f101076-a50f-49ea-a83b-46eaa8b39cc4", + "metadata": {}, + "source": [ + "## Create Seed Dataset for SDG\n", + "\n", + "This step creates the seed data for SDG. This data is a JSON filed that contains a combination of the `seed_examples` in the qna.yaml and the chunks from the source document. \n", + "\n", + "Intermediate seed data files are created for each contribution with the contribution's name included in the file name. For example in the `nfl` contribution, a file containing seed data called `seed_data-nfl.jsonl` would be created in `$WORKSPACE_DIR/nfl`. This file contains a combination of all of the chunks from the NFL source documents and the seed examples in the `qna.yaml` in `$WORKSPACE_DIR/nfl/authoring`.\n", + "\n", + "After seed data files are created for each contribution, a final `seed_data.jsonl` is created in `$WORKSPACE_DIR`. This file is a concatenation of all of the intermediate `seed_data-{contribution name}.jsonl` files and should be used as an input to SDG." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e2c6e31b-e8a9-406c-b2dc-27433c8fd8ec", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install -qq datasets transformers" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ab2c9ed2-8ba8-4959-8e01-81625b81d286", + "metadata": {}, + "outputs": [], + "source": [ + "from utils.create_seed_dataset import get_seed_dataset, safe_concatenate_datasets\n", + "\n", + "contribution_datasets = []\n", + "for contribution in contributions:\n", + " chunks_dir = contribution[\"dir\"] / CHUNKING_DIR\n", + " qna_dir = contribution[\"dir\"] / AUTHORING_DIR\n", + " seed_data = get_seed_dataset(chunks_dir, qna_dir)\n", + " output_path = f'{contribution_dir}/seed_data-{contribution_name}.jsonl'\n", + " seed_data.to_json(output_path, orient='records', lines=True)\n", + " contribution_datasets.append(seed_data)\n", + " print(f\"Intermediate results saved to: {output_path}\")\n", + "\n", + "final_seed_data = safe_concatenate_datasets(contribution_datasets)\n", + "output_path = f'{WORKSPACE_DIR}/seed_data.jsonl'\n", + "final_seed_data.to_json(output_path, orient='records', lines=True)\n", + "\n", + "print(f\"Final seed data contains {final_seed_data.data.num_rows} rows\")\n", + "print(f\"Final seed data for SDG saved to: {output_path}\")" + ] + }, + { + "cell_type": "markdown", + "id": "50ff36f4-19fc-4a27-b51a-3688e7b630e4", + "metadata": {}, + "source": [ + "### Inspect the seed data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a6936825-31c1-4b46-a1af-2fb46f50158d", + "metadata": {}, + "outputs": [], + "source": [ + "print(seed_data.data.table.slice(length=1))" + ] + }, + { + "cell_type": "markdown", + "id": "24a8fcdb-8035-4f30-b856-46afe9f928a1", + "metadata": {}, + "source": [ + "# Summary\n", + "\n", + "To recap, given source documents in PDF format, this notebook:\n", + "\n", + "1. Converts the documents using Docling and saves in the Docling Document format\n", + "2. Splits the extracted text into chunks of JSON\n", + "3. Generates Q&A pairs for a subset of those chunks\n", + "4. Creates a `qna.yaml` available for inspection and revision\n", + "5. Combines the chunks and `qna.yaml` to create a `seed_data.jsonl` to use for SDG\n", + "\n", + "The next step is to use the resulting `seed_data.jsonl` for SDG, such as illustrated in [this notebook](https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub/blob/main/examples/instructlab/knowledge/knowledge_generation_and_mixing.ipynb)." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 }