diff --git a/notebooks/instructlab-knowledge/instructlab-knowledge.ipynb b/notebooks/instructlab-knowledge/instructlab-knowledge.ipynb
index 950549c..0a5e53b 100644
--- a/notebooks/instructlab-knowledge/instructlab-knowledge.ipynb
+++ b/notebooks/instructlab-knowledge/instructlab-knowledge.ipynb
@@ -1,765 +1,878 @@
 {
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "af99f876-0ffd-4079-aeb7-4cead05daaf4",
-   "metadata": {},
-   "source": [
-    "# 🐶 Data Pre-Processing: From source PDF to SDG-ready\n",
-    "\n",
-    "This notebook outlines the data pre-processing stages for knowledge contributions. A knowledge contribution consists of one or more PDF files that serve as the dataset for fine-tuning a model.\n",
-    "\n",
-    "At a high level the steps for the data pre-processing are:\n",
-    "\n",
-    "1. [Contribution Overview](#Contribution-Overview)\n",
-    "1. [Getting Started](#Getting-Started)\n",
-    "1. [Data Gathering](#Data-Gathering)\n",
-    "1. [Document Conversion](#Document-Conversion)\n",
-    "1. [Chunking](#Chunking)\n",
-    "1. [Authoring](#Authoring)\n",
-    "1. [Create Seed Dataset](#Create-Seed-Dataset-for-SDG)\n",
-    "\n",
-    "Each step occurs in order and produces outputs used in subsequent steps. The final step creates an SDG dataset that allows users to run the [SDG-Hub knowledge-generation notebook](https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub/blob/main/examples/knowledge_tuning/instructlab/knowledge_generation_and_mixing.ipynb) and generate samples.\n",
-    "\n",
-    "**NOTE**: Starting the notebook using Python 3.12 is recommended.\n",
-    "\n",
-    "\n",
-    "***"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "03227e64-b5d7-4394-af30-530fc5baed2d",
-   "metadata": {},
-   "source": [
-    "## Contribution Overview"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "1a008179-e734-4476-bfc2-a1e673efde79",
-   "metadata": {
-    "jp-MarkdownHeadingCollapsed": true
-   },
-   "source": [
-    "### What is a Contribution?\n",
-    "\n",
-    "To add knowledge to a model, a user groups source documents of that contain the knowledge into knowledge contributions. A knowledge contribution is made up of:\n",
-    "\n",
-    "1. One or more PDF documents that can be described by a contribution summary.\n",
-    "2. A contribution summary.\n",
-    "3. A contribution domain.\n",
-    "4. A unique name used to create a directory in the workspace for artifacts created by each step for the contribution.\n",
-    "\n",
-    "Once contributions are set up a user can go through the data pre-processing workflow.\n",
-    "\n",
-    "### What is a Contribution Summary?\n",
-    "\n",
-    "In the synthetic data generation step, a model (known as the teacher model) generates synthetic data based on the source document.\n",
-    "The contribution summary and domain are used in the prompts that are sent to the teacher model to create data.\n",
-    "\n",
-    "The document gets broken up into [chunks](#Chunking), and each chunk is in the prompt sent to the teacher model.\n",
-    "The contribution summary provides additional context to each chunk of a source document ensuring the teacher model has necessary background information.\n",
-    "\n",
-    "Contribution summaries should be specific, avoid acronyms or other vague references, and the represent the documents focus areas.\n",
-    "When a contribution includes many versions of the same document, publication dates, volume numbers, or any other identifiers to distinguish between versions should be included in the contribution summary.\n",
-    "\n",
-    "Here is an example of a contribution summary from a recent paper on [inference-time scaling](https://arxiv.org/pdf/2502.01618):\n",
-    "\n",
-    "```\n",
-    "\"A Probabilistic Inference Approach to Inference-Time Scaling of Large Language Models (LLMs)\"\n",
-    "```\n",
-    "\n",
-    "Since the title of the paper does a good job summaraizing the paper, the summary is based off the title but with the acronym LLM spelled out. \n",
-    "\n",
-    "Usually contributions only have one document. Contributions with multiple documents happen when the subject matter and format are similar among a group of documents. \n",
-    "\n",
-    "An example of a contribution having multiple documents would be the desire to teach a model an organization's bylaws over the years 2021, 2022, 2023, 2024, with a different PDF for each year.\n",
-    "\n",
-    "A contribution summary in this case might look like:\n",
-    "\n",
-    "`Bylaws of organization Foo from 2021 - 2024`\n",
-    "\n",
-    "In the case that there was only one source document from the year 2023, the contribution summary would be:\n",
-    "\n",
-    "`2023 Bylaws of organization Foo`\n",
-    "\n",
-    "Another example of having multiple documents within the same contribution would be if the documents had the same format. An example here could be grouping together a furniture company's instruction manuals. The format and layout of the instruction manuals would be the same across different pieces of furniture, but each manual covers different furniture.\n",
-    "\n",
-    "`Furniture company Foo's assembly instructions for tables, desks, and nightstands`\n",
-    "\n",
-    "If the contribution only contained a PDF for the assembly instructions for an oak dining table the summary would be:\n",
-    "\n",
-    "`Assembly instructions for furniture company Foo's oak dining table`\n",
-    "\n",
-    "### What is a Contribution Domain?\n",
-    "\n",
-    "A contribution's domain is the overarching subject or scope of the source document(s). The domain provides critical context to guide the teacher model in generating synthetic data that is relevant and grounded.\n",
-    "\n",
-    "The domain is brief and should not exceed 3 words, but should ideally be 1-2 words.\n",
-    "\n",
-    "To determine the domain, users should review document's primary subject and identify the main topic or purpose of the document.\n",
-    "Consider the intended use of the document and align it with the use case or audience. E.g. a tech manual for developers might fall under the “software development” domain.\n",
-    "\n",
-    "For the contribution summary examples discussed in the previous sections, domains could be `Artificial Intelligence Research`, `Bylaws`, and `Furniture Assembly`.\n",
-    "\n",
-    "**Note:** Different contributions can have the same domain"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "0b02a66e-125e-47e6-9b6b-5f49d50990ca",
-   "metadata": {},
-   "source": [
-    "## Getting Started\n",
-    "\n",
-    "The first step in this notebook is to establish a workspace. Workspaces allow multi-tenancy or multiple different runs of this notebook. Without workspaces the results of each of the steps would be overwritten each time this notebook is executed.\n",
-    "\n",
-    "Users should change the `WORKSPACE_NAME` to suite their needs.\n",
-    "\n",
-    "> **NOTE:**\n",
-    "> If this notebook is ever run from the middle the following two cells need to be rerun to initialize variables used in every section."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "0acd026f-65bd-4393-bb40-f8aa8bd6828b",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from pathlib import Path\n",
-    "\n",
-    "WORKSPACE_NAME = \"default\"\n",
-    "\n",
-    "WORKSPACE_ROOT = Path(\"workspaces\")\n",
-    "WORKSPACE_ROOT.mkdir(exist_ok=True)\n",
-    "\n",
-    "WORKSPACE_DIR = WORKSPACE_ROOT / WORKSPACE_NAME\n",
-    "WORKSPACE_DIR.mkdir(exist_ok=True)\n",
-    "\n",
-    "SOURCE_DOCUMENT_DIR = \"source_documents\"\n",
-    "CONVERSION_DIR = \"conversion\"\n",
-    "CHUNKING_DIR = \"chunking\"\n",
-    "AUTHORING_DIR = \"authoring\""
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "412d5a43-4ec4-43e5-8f08-21aae6c69bfd",
-   "metadata": {},
-   "source": [
-    "To create contributions, define the `name` for the contribution, and the `domain` and `summary`. The `name`, `domain` and `summary` go into a dictionary called `knowledge_contribution` which gets added to a list called `contributions`.\n",
-    "\n",
-    "Once the list of `contributions` is set, a directory with each contribution name is created within the workspace and subdirectories for `source_documents`, `conversion`, `chunking`, `authoring` are created within the contribution name directory."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "8b440e34-817c-4588-9a8c-790f74ec5dbb",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Populated later on\n",
-    "contributions = []\n",
-    "\n",
-    "# Inference Time Scaling Contribution\n",
-    "contribution_name = \"inference-time-scaling\"\n",
-    "contribution_domain = \"Artificial Intelligence Research\" \n",
-    "contribution_summary = \"A Probabilistic Inference Approach to Inference-Time Scaling of Large Language Models (LLMs)\"\n",
-    "\n",
-    "# Add contribution information to the knowledge_contribution dictionary for it\n",
-    "knowledge_contribution = {\"name\": contribution_name, \"domain\": contribution_domain, \"summary\": contribution_summary}\n",
-    "contributions.append(knowledge_contribution)\n",
-    "\n",
-    "# NFL Rules Contribution\n",
-    "contribution2_name = \"nfl\"\n",
-    "contribution2_domain = \"sports rules\" \n",
-    "contribution2_summary = \"Official playing rules of the National Football League 2022, 2023\"\n",
-    "knowledge_contribution2 = {\"name\": contribution2_name, \"domain\": contribution2_domain, \"summary\": contribution2_summary}\n",
-    "contributions.append(knowledge_contribution2)\n",
-    "\n",
-    "for contribution in contributions:\n",
-    "    contribution_dir = WORKSPACE_DIR / contribution[\"name\"]\n",
-    "    contribution[\"dir\"] = contribution_dir\n",
-    "\n",
-    "    for subdir in [SOURCE_DOCUMENT_DIR, CONVERSION_DIR, CHUNKING_DIR, AUTHORING_DIR]:\n",
-    "        (contribution_dir / subdir).mkdir(parents=True, exist_ok=True)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "344b7ac5-fc2a-40a8-8e1f-e8dd8b1153e7",
-   "metadata": {},
-   "source": [
-    "## Data Gathering\n",
-    "\n",
-    "Copy each contribution file to the `WORKSPACE_DIR/<CONTRIBUTION NAME>/source_documents` directory for the following conversion step to detect them."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "26501e2f-7215-441f-9efa-075f87024893",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import shutil\n",
-    "\n",
-    "# Inference Time Scaling Contribution\n",
-    "orig_path = Path(\"sample-pdfs/inference-time-scaling.pdf\")\n",
-    "dst_path = WORKSPACE_DIR / contribution_name / SOURCE_DOCUMENT_DIR\n",
-    "\n",
-    "shutil.copy(orig_path, dst_path)\n",
-    "\n",
-    "# NFL Rules Contribution\n",
-    "rules_2022 = Path(\"sample-pdfs/2022-nfl-rulebook.pdf\")\n",
-    "rules_2023 = Path(\"sample-pdfs/2023-nfl-rulebook.pdf\")\n",
-    "rules_dst = WORKSPACE_DIR / contribution2_name / SOURCE_DOCUMENT_DIR\n",
-    "\n",
-    "shutil.copy(rules_2022, rules_dst)\n",
-    "shutil.copy(rules_2023, rules_dst) "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "68478061",
-   "metadata": {},
-   "source": [
-    "Review this list of files to verify that all expected files are included in each of the contributions."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "5325fa71-d09f-457f-9e55-be106dcf78e0",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "print(\"Files to pre-process\\n--------------------\")\n",
-    "for contribution in contributions:\n",
-    "    print(f\"\\nContribution: {contribution.get(\"name\")}\")\n",
-    "    print(\"Files:\")\n",
-    "    files = list((contribution['dir'] / SOURCE_DOCUMENT_DIR).glob(\"*.pdf\"))\n",
-    "    for file in files:\n",
-    "        print(file.resolve())"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "8a4904e6-8e12-4473-8301-cba90e61bd8b",
-   "metadata": {},
-   "source": [
-    "## Document Conversion\n",
-    "\n",
-    "This notebook uses [Docling](https://github.com/docling-project/docling) to convert any type of document into a Docling Document: a structured representation of the original document that can be exported as JSON. The resulting JSON output is used in the following step, which performs Docling's chunking methods."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "b91d4b2e-19cd-46e7-a912-ba9b2904c7cd",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!pip install -qq docling"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "749fb64b-d089-4844-9330-7f3639819e7a",
-   "metadata": {},
-   "source": [
-    "### Configure Docling conversion pipeline\n",
-    "\n",
-    "Next we set the configuration options for our conversion pipeline. The PDF Conversion options set here are the defaults. More information about pipeline configuration can be found on Docling.\n",
-    "\n",
-    "For a complete reference on Docling conversion pipeline configuration, see [PDFPipelineOptions](https://docling-project.github.io/docling/reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions) and [PDFFormatOptions](https://docling-project.github.io/docling/reference/document_converter/#docling.document_converter.InputFormat.XML_JATS)."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "157c5e02-edd1-44f6-b20f-f6b4bda1aae7",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from docling.document_converter import DocumentConverter, PdfFormatOption\n",
-    "from docling.datamodel.base_models import InputFormat\n",
-    "from docling.datamodel.pipeline_options import PdfPipelineOptions\n",
-    "\n",
-    "pipeline_options = PdfPipelineOptions() # TODO: show the options that can be set\n",
-    "\n",
-    "doc_converter = DocumentConverter(\n",
-    "    format_options={\n",
-    "        InputFormat.PDF: PdfFormatOption(\n",
-    "            pipeline_options=pipeline_options\n",
-    "        )\n",
-    "    }\n",
-    ")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "73400c74-dead-4998-aee2-ddb00ddaa276",
-   "metadata": {},
-   "source": [
-    "Finally, we convert every document into Docling JSON as long as it is a valid file type to be converted"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "a200039c-b8b2-4087-88ba-7bfb0e393cc9",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import json\n",
-    "\n",
-    "json_files=[]\n",
-    "for contribution in contributions:\n",
-    "    files = list((contribution[\"dir\"] / SOURCE_DOCUMENT_DIR).glob(\"*.pdf\"))\n",
-    "                 \n",
-    "    for file in files:\n",
-    "        doc = doc_converter.convert(source=file).document\n",
-    "        doc_dict = doc.export_to_dict()\n",
-    "   \n",
-    "        conversion_output_dir = contribution[\"dir\"] / CONVERSION_DIR\n",
-    "        conversion_output_dir.mkdir(parents=True, exist_ok=True)\n",
-    "        \n",
-    "        json_output_path = conversion_output_dir / f\"{file.stem}.json\"\n",
-    "        with open(json_output_path, \"w\") as f:\n",
-    "            json.dump(doc_dict, f)\n",
-    "            print(f\"Path of JSON output is: {Path(json_output_path).resolve()}\")\n",
-    "            json_files.append(json_output_path.resolve())"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "40710019-7ec9-414e-ad72-1ba672cf5fc2",
-   "metadata": {},
-   "source": [
-    "### Post-Conversion: Illuminator Analysis"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "2572e2d0-94dc-4ca0-b032-3978af26c9c9",
-   "metadata": {},
-   "source": [
-    "The output of document conversion is not always perfect. Data may become distorted or corrupted, which can negatively affect a model's performance after training. While optional, reviewing your converted data is strongly recommended. The following example explains how to use the Illuminator tool to identify common conversion issues."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "09e07e35-befb-4ed5-9fe4-41544f88d943",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from utils.illuminator.analysis import analyze_docling_tables\n",
-    "from utils.illuminator.utils import generate_summary\n",
-    "from docling.datamodel.document import DoclingDocument\n",
-    "\n",
-    "import json\n",
-    "import sys\n",
-    "from pathlib import Path\n",
-    "\n",
-    "for contribution in contributions:\n",
-    "    conversion_dir = contribution[\"dir\"] / CONVERSION_DIR\n",
-    "    converted_json_paths = list(conversion_dir.glob(\"*.json\"))\n",
-    "    results = {}\n",
-    "    \n",
-    "    for path in converted_json_paths:\n",
-    "        with open(path, \"r\") as f:\n",
-    "            doc_dict = json.load(f)\n",
-    "    \n",
-    "        doc = DoclingDocument(**doc_dict)\n",
-    "        results[path] = analyze_docling_tables(doc)\n",
-    "    \n",
-    "        summary_path = contribution[\"dir\"] / CONVERSION_DIR / f\"illuminator-readable-summary-{doc.name}.txt\"\n",
-    "        \n",
-    "        with open(summary_path, \"w\") as f:\n",
-    "            generate_summary(results, file=f)\n",
-    "        \n",
-    "        print(f\"✅ Post-conversion summary saved to: {summary_path.resolve()}\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "eea0876e-ac55-45fc-93e8-3e646a6c3104",
-   "metadata": {},
-   "source": [
-    "\n",
-    "The output of this post-conversion step should help determine whether to avoid using the content for chunking entirely or to manually edit it before proceeding with chunking.\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "cafad55e-a4c0-4d6e-9da0-49519fa9bf74",
-   "metadata": {},
-   "source": [
-    "## Chunking\n",
-    "\n",
-    "The goal of chunking the converted documents is to provide the teacher model small and logical pieces of the source document to generate data off of.\n",
-    "\n",
-    "In this notebook we are doing chunking with [Docling](https://docling-project.github.io/docling/examples/hybrid_chunking/#hybrid-chunking).\n",
-    "\n",
-    "The input to this notebook is a docling JSON file created after a docling conversion, or a directory of docling JSON files."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "2482060c-a49f-4345-aa47-d54301939387",
-   "metadata": {},
-   "source": [
-    "### Initialize the Chunker\n",
-    "\n",
-    "Docling provides two chunkers, the `HierarchicalChunker` and the `HybridChunker`.\n",
-    "The `HierarchicalChunker` creates chunks based on the hierarchy in the Docling document\n",
-    "\n",
-    "The `HybridChunker` builds on the `HierarchicalChunker` and by making it tokenization aware.\n",
-    "\n",
-    "The `HybridChunker` has options for a `tokenizer`, the `max_tokens` in a chunk, and whether to merge undersized peer chunks. Uncomment the commented out code to configure these."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "50df9d91-add4-46a1-a69d-0f7f9f69542e",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "#from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer\n",
-    "#from transformers import AutoTokenizer\n",
-    "\n",
-    "from docling.chunking import HybridChunker\n",
-    "\n",
-    "#EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\n",
-    "#MAX_TOKENS = 1024\n",
-    "#\n",
-    "# tokenizer = HuggingFaceTokenizer(\n",
-    "#     tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),\n",
-    "#     max_tokens=MAX_TOKENS,  # optional, by default derived from `tokenizer` for HF case\n",
-    "#     merge_peers=True # \n",
-    "# )\n",
-    "\n",
-    "chunker = HybridChunker(\n",
-    "    #tokenizer=tokenizer,\n",
-    "    #merge_peers=True,  # whether to merge undersized chunks - defaults to True\n",
-    ")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "54ce1d6f-b8d3-470c-b3c9-675911f0ee92",
-   "metadata": {},
-   "source": [
-    "### Load and chunk the converted docling document\n",
-    "\n",
-    "Next lets convert the document we want to chunk up into a Docling Document.\n",
-    "\n",
-    "The resulting chunks are stored in a file called chunks.jsonl in the `chunks` directory in your contribution. This file is used as an input in a later step when creating the seed dataset for SDG."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "db983c05-4aa6-4261-9283-2adab69bfbd3",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import json\n",
-    "from docling.document_converter import DocumentConverter\n",
-    "\n",
-    "all_chunks = []\n",
-    "\n",
-    "for contribution in contributions:\n",
-    "    conversion_dir = contribution[\"dir\"] / CONVERSION_DIR\n",
-    "    json_files = list(conversion_dir.glob(\"*.json\"))\n",
-    "    chunking_output_dir = contribution[\"dir\"] / CHUNKING_DIR\n",
-    "    chunking_output_dir.mkdir(parents=True, exist_ok=True)\n",
-    "    contribution_chunks = []\n",
-    "    \n",
-    "    for file in json_files:\n",
-    "        # reconvert the docling JSON for chunking\n",
-    "        doc = DocumentConverter().convert(source=file)\n",
-    "        \n",
-    "        chunk_iter = chunker.chunk(dl_doc=doc.document)\n",
-    "        chunk_objs = list(chunk_iter)\n",
-    "    \n",
-    "        print(f\"Extracted {len(chunk_objs)} chunks from {doc.document.name}\")\n",
-    "        \n",
-    "        for chunk in chunk_objs:\n",
-    "            c = dict(chunk=chunker.contextualize(chunk=chunk), file=doc.document.name,metadata=chunk.meta.export_json_dict())\n",
-    "            contribution_chunks.append(c)\n",
-    "            all_chunks.append(c)\n",
-    "\n",
-    "\n",
-    "        chunks_file_path = chunking_output_dir / \"chunks.jsonl\"\n",
-    "        with open(chunks_file_path, \"w\", encoding=\"utf-8\") as file:\n",
-    "            for chunk in contribution_chunks:\n",
-    "                json.dump(chunk, file)\n",
-    "                file.write(\"\\n\")\n",
-    "            print(f\"Path of chunks JSON is: {Path(chunks_file_path).resolve()}\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "0fb38545-eb84-4923-8fc4-d10ed08eab26",
-   "metadata": {},
-   "source": [
-    "### View the Chunks"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "4fdf34c7-9829-43d2-bf9f-7d1d55bb6a4c",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "chunk_gen = iter(all_chunks)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "811992ac",
-   "metadata": {},
-   "source": [
-    "To view the chunks one by one, rerun the following cell. The document is now broken into small sections with metadata about the chunk based on the document's format."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "ee9a8531",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "print(next(chunk_gen)['chunk'])"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "a510f8c7-8cd3-4867-8742-9f4f9cda9e9f",
-   "metadata": {},
-   "source": [
-    "## Authoring\n",
-    "\n",
-    "To start the synthetic data generation process, users need to prepare a diverse set of questions and answers based off chunks from each source document. A chunk and question-and-answer pairs are called a seed example."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "f3490c8a-5ee8-44cd-ae5e-26a6ca7b4017",
-   "metadata": {},
-   "source": [
-    "### Install docling-sdg\n",
-    "\n",
-    "[Docling-sdg](https://github.com/docling-project/docling-sdg) project is used to generate question and answer pairs for seed examples."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "86c48e52-cda7-48ac-84dc-0b844aed5f98",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!pip install -qq docling-sdg"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "d65ec755-e3de-40ab-bf3a-23ebb29a705d",
-   "metadata": {},
-   "source": [
-    "### Initialize QA generator model & Number of Seed examples\n",
-    "\n",
-    "To generate seed examples you need to set: \n",
-    "1. The the Open AI compatible endpoint for the model generating question and answer pairs\n",
-    "2. The model's API key\n",
-    "3. The model's name\n",
-    "4. The number of seed example you wish to generate for each contribution"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "874d4de8",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import os\n",
-    "\n",
-    "API_KEY = os.getenv(\"MODEL_API_KEY\") or \"<INSERT API KEY HERE>\"  # the API access key for your account (cannot be empty)\n",
-    "ENDPOINT_URL = os.getenv(\"MODEL_ENDPOINT_URL\") or \"<INSERT ENDPOINT URL HERE>\" # the URL of your model's API. URL can end in \"/v1\"\n",
-    "MODEL_NAME = os.getenv(\"MODEL_NAME\") or \"mistralai/Mixtral-8x7B-Instruct-v0.1\" # the name of your model\n",
-    "NUM_SEED_EXAMPLES = 7"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "b702267e-f550-4bc2-bce4-c0fcecbbd292",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from utils.qna_gen import generate_seed_examples\n",
-    "\n",
-    "for contribution in contributions:\n",
-    "    chunks_jsonl_path = contribution[\"dir\"] / CHUNKING_DIR / \"chunks.jsonl\"\n",
-    "    authoring_path = contribution[\"dir\"] / AUTHORING_DIR\n",
-    "\n",
-    "    qna_output_path = generate_seed_examples(contribution[\"name\"],\n",
-    "                           chunks_jsonl_path,\n",
-    "                           authoring_path,\n",
-    "                           contribution[\"domain\"],\n",
-    "                           contribution[\"summary\"],\n",
-    "                           NUM_SEED_EXAMPLES,\n",
-    "                           API_KEY,\n",
-    "                           ENDPOINT_URL,\n",
-    "                           MODEL_NAME)\n",
-    "    print(f\"qna.yaml saved to: {qna_output_path}\")\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "6c574f96-5860-48b9-b4ac-01d367c7717b",
-   "metadata": {},
-   "source": [
-    "### Review and Revise Seed Examples\n",
-    "\n",
-    "A quality set of seed examples has diverse contexts and question-and-answer pairs across every seed example. You can asses the `qna.yaml` files in your preferred text editor to ensure the quality, diversity, and style of generated questions and answers, and modify them accordingly.\n",
-    "\n",
-    "After assessment, the `qna.yaml` files can be quickly reviewed to ensure they includes the required elements and correct number of each. It is recommended to have at least 5 seed examples. Each seed example must have 3 question and answer pairs."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "da3ef131-e5a3-4854-b6e9-3277273a91dd",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from utils.qna_gen import review_seed_examples_file\n",
-    "\n",
-    "\n",
-    "\n",
-    "for contribution in contributions:\n",
-    "        qna_path = contribution[\"dir\"] / AUTHORING_DIR / \"qna.yaml\"\n",
-    "        review_seed_examples_file(qna_path, min_seed_examples=5, num_qa_pairs=3)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "1f101076-a50f-49ea-a83b-46eaa8b39cc4",
-   "metadata": {},
-   "source": [
-    "## Create Seed Dataset for SDG\n",
-    "\n",
-    "This step creates the seed data for SDG. This data is a JSON filed that contains a combination of the `seed_examples` in the qna.yaml and the chunks from the source document. \n",
-    "\n",
-    "Intermediate seed data files are created for each contribution with the contribution's name included in the file name. For example in the `nfl` contribution, a file containing seed data called `seed_data-nfl.jsonl` would be created in `$WORKSPACE_DIR/nfl`. This file contains a combination of all of the chunks from the NFL source documents and the seed examples in the `qna.yaml` in `$WORKSPACE_DIR/nfl/authoring`.\n",
-    "\n",
-    "After seed data files are created for each contribution, a final `seed_data.jsonl` is created in `$WORKSPACE_DIR`. This file is a concatenation of all of the intermediate `seed_data-{contribution name}.jsonl` files and should be used as an input to SDG."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "e2c6e31b-e8a9-406c-b2dc-27433c8fd8ec",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!pip install -qq datasets transformers"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "ab2c9ed2-8ba8-4959-8e01-81625b81d286",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from utils.create_seed_dataset import get_seed_dataset, safe_concatenate_datasets\n",
-    "\n",
-    "contribution_datasets = []\n",
-    "for contribution in contributions:\n",
-    "    chunks_dir = contribution[\"dir\"] / CHUNKING_DIR\n",
-    "    qna_dir = contribution[\"dir\"] / AUTHORING_DIR\n",
-    "    seed_data = get_seed_dataset(chunks_dir, qna_dir)\n",
-    "    output_path = f'{contribution_dir}/seed_data-{contribution_name}.jsonl'\n",
-    "    seed_data.to_json(output_path, orient='records', lines=True)\n",
-    "    contribution_datasets.append(seed_data)\n",
-    "    print(f\"Intermediate results saved to: {output_path}\")\n",
-    "\n",
-    "final_seed_data = safe_concatenate_datasets(contribution_datasets)\n",
-    "output_path = f'{WORKSPACE_DIR}/seed_data.jsonl'\n",
-    "final_seed_data.to_json(output_path, orient='records', lines=True)\n",
-    "\n",
-    "print(f\"Final seed data contains {final_seed_data.data.num_rows} rows\")\n",
-    "print(f\"Final seed data for SDG saved to: {output_path}\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "50ff36f4-19fc-4a27-b51a-3688e7b630e4",
-   "metadata": {},
-   "source": [
-    "### Inspect the seed data"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "a6936825-31c1-4b46-a1af-2fb46f50158d",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "print(seed_data.data.table.slice(length=1))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "24a8fcdb-8035-4f30-b856-46afe9f928a1",
-   "metadata": {},
-   "source": [
-    "# Summary\n",
-    "\n",
-    "To recap, given source documents in PDF format, this notebook:\n",
-    "\n",
-    "1. Converts the documents using Docling and saves in the Docling Document format\n",
-    "2. Splits the extracted text into chunks of JSON\n",
-    "3. Generates Q&A pairs for a subset of those chunks\n",
-    "4. Creates a `qna.yaml` available for inspection and revision\n",
-    "5. Combines the chunks and `qna.yaml` to create a `seed_data.jsonl` to use for SDG\n",
-    "\n",
-    "The next step is to use the resulting `seed_data.jsonl` for SDG, such as illustrated in [this notebook](https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub/blob/main/examples/instructlab/knowledge/knowledge_generation_and_mixing.ipynb)."
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.12.10"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
+    "cells": [
+        {
+            "cell_type": "markdown",
+            "id": "af99f876-0ffd-4079-aeb7-4cead05daaf4",
+            "metadata": {},
+            "source": [
+                "# 🐶 Data Pre-Processing: From source PDF to SDG-ready\n",
+                "\n",
+                "This notebook outlines the data pre-processing stages for knowledge contributions. A knowledge contribution consists of one or more PDF files that serve as the dataset for fine-tuning a model.\n",
+                "\n",
+                "At a high level the steps for the data pre-processing are:\n",
+                "\n",
+                "1. [Contribution Overview](#Contribution-Overview)\n",
+                "1. [Getting Started](#Getting-Started)\n",
+                "1. [Data Gathering](#Data-Gathering)\n",
+                "1. [Document Conversion](#Document-Conversion)\n",
+                "1. [Chunking](#Chunking)\n",
+                "1. [Authoring](#Authoring)\n",
+                "1. [Create Seed Dataset](#Create-Seed-Dataset-for-SDG)\n",
+                "\n",
+                "Each step occurs in order and produces outputs used in subsequent steps. The final step creates an SDG dataset that allows users to run the [SDG-Hub knowledge-generation notebook](https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub/blob/main/examples/knowledge_tuning/instructlab/knowledge_generation_and_mixing.ipynb) and generate samples.\n",
+                "\n",
+                "**NOTE**: Starting the notebook using Python 3.12 is recommended.\n",
+                "\n",
+                "\n",
+                "***"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "id": "03227e64-b5d7-4394-af30-530fc5baed2d",
+            "metadata": {},
+            "source": [
+                "## Contribution Overview"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "id": "1a008179-e734-4476-bfc2-a1e673efde79",
+            "metadata": {
+                "jp-MarkdownHeadingCollapsed": true
+            },
+            "source": [
+                "### What is a Contribution?\n",
+                "\n",
+                "To add knowledge to a model, a user groups source documents of that contain the knowledge into knowledge contributions. A knowledge contribution is made up of:\n",
+                "\n",
+                "1. One or more PDF documents that can be described by a contribution summary.\n",
+                "2. A contribution summary.\n",
+                "3. A contribution domain.\n",
+                "4. A unique name used to create a directory in the workspace for artifacts created by each step for the contribution.\n",
+                "\n",
+                "Once contributions are set up a user can go through the data pre-processing workflow.\n",
+                "\n",
+                "### What is a Contribution Summary?\n",
+                "\n",
+                "In the synthetic data generation step, a model (known as the teacher model) generates synthetic data based on the source document.\n",
+                "The contribution summary and domain are used in the prompts that are sent to the teacher model to create data.\n",
+                "\n",
+                "The document gets broken up into [chunks](#Chunking), and each chunk is in the prompt sent to the teacher model.\n",
+                "The contribution summary provides additional context to each chunk of a source document ensuring the teacher model has necessary background information.\n",
+                "\n",
+                "Contribution summaries should be specific, avoid acronyms or other vague references, and the represent the documents focus areas.\n",
+                "When a contribution includes many versions of the same document, publication dates, volume numbers, or any other identifiers to distinguish between versions should be included in the contribution summary.\n",
+                "\n",
+                "Here is an example of a contribution summary from a recent paper on [inference-time scaling](https://arxiv.org/pdf/2502.01618):\n",
+                "\n",
+                "```\n",
+                "\"A Probabilistic Inference Approach to Inference-Time Scaling of Large Language Models (LLMs)\"\n",
+                "```\n",
+                "\n",
+                "Since the title of the paper does a good job summaraizing the paper, the summary is based off the title but with the acronym LLM spelled out. \n",
+                "\n",
+                "Usually contributions only have one document. Contributions with multiple documents happen when the subject matter and format are similar among a group of documents. \n",
+                "\n",
+                "An example of a contribution having multiple documents would be the desire to teach a model an organization's bylaws over the years 2021, 2022, 2023, 2024, with a different PDF for each year.\n",
+                "\n",
+                "A contribution summary in this case might look like:\n",
+                "\n",
+                "`Bylaws of organization Foo from 2021 - 2024`\n",
+                "\n",
+                "In the case that there was only one source document from the year 2023, the contribution summary would be:\n",
+                "\n",
+                "`2023 Bylaws of organization Foo`\n",
+                "\n",
+                "Another example of having multiple documents within the same contribution would be if the documents had the same format. An example here could be grouping together a furniture company's instruction manuals. The format and layout of the instruction manuals would be the same across different pieces of furniture, but each manual covers different furniture.\n",
+                "\n",
+                "`Furniture company Foo's assembly instructions for tables, desks, and nightstands`\n",
+                "\n",
+                "If the contribution only contained a PDF for the assembly instructions for an oak dining table the summary would be:\n",
+                "\n",
+                "`Assembly instructions for furniture company Foo's oak dining table`\n",
+                "\n",
+                "### What is a Contribution Domain?\n",
+                "\n",
+                "A contribution's domain is the overarching subject or scope of the source document(s). The domain provides critical context to guide the teacher model in generating synthetic data that is relevant and grounded.\n",
+                "\n",
+                "The domain is brief and should not exceed 3 words, but should ideally be 1-2 words.\n",
+                "\n",
+                "To determine the domain, users should review document's primary subject and identify the main topic or purpose of the document.\n",
+                "Consider the intended use of the document and align it with the use case or audience. E.g. a tech manual for developers might fall under the “software development” domain.\n",
+                "\n",
+                "For the contribution summary examples discussed in the previous sections, domains could be `Artificial Intelligence Research`, `Bylaws`, and `Furniture Assembly`.\n",
+                "\n",
+                "**Note:** Different contributions can have the same domain"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "id": "0b02a66e-125e-47e6-9b6b-5f49d50990ca",
+            "metadata": {},
+            "source": [
+                "## Getting Started\n",
+                "\n",
+                "The first step in this notebook is to establish a workspace. Workspaces allow multi-tenancy or multiple different runs of this notebook. Without workspaces the results of each of the steps would be overwritten each time this notebook is executed.\n",
+                "\n",
+                "Users should change the `WORKSPACE_NAME` to suite their needs.\n",
+                "\n",
+                "> **NOTE:**\n",
+                "> If this notebook is ever run from the middle the following two cells need to be rerun to initialize variables used in every section."
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": null,
+            "id": "0acd026f-65bd-4393-bb40-f8aa8bd6828b",
+            "metadata": {},
+            "outputs": [],
+            "source": [
+                "from pathlib import Path\n",
+                "\n",
+                "WORKSPACE_NAME = \"default\"\n",
+                "\n",
+                "WORKSPACE_ROOT = Path(\"workspaces\")\n",
+                "WORKSPACE_ROOT.mkdir(exist_ok=True)\n",
+                "\n",
+                "WORKSPACE_DIR = WORKSPACE_ROOT / WORKSPACE_NAME\n",
+                "WORKSPACE_DIR.mkdir(exist_ok=True)\n",
+                "\n",
+                "SOURCE_DOCUMENT_DIR = \"source_documents\"\n",
+                "CONVERSION_DIR = \"conversion\"\n",
+                "CHUNKING_DIR = \"chunking\"\n",
+                "AUTHORING_DIR = \"authoring\""
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "id": "412d5a43-4ec4-43e5-8f08-21aae6c69bfd",
+            "metadata": {},
+            "source": [
+                "To create contributions, define the `name` for the contribution, and the `domain` and `summary`. The `name`, `domain` and `summary` go into a dictionary called `knowledge_contribution` which gets added to a list called `contributions`.\n",
+                "\n",
+                "Once the list of `contributions` is set, a directory with each contribution name is created within the workspace and subdirectories for `source_documents`, `conversion`, `chunking`, `authoring` are created within the contribution name directory."
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": null,
+            "id": "8b440e34-817c-4588-9a8c-790f74ec5dbb",
+            "metadata": {},
+            "outputs": [],
+            "source": [
+                "# Populated later on\n",
+                "contributions = []\n",
+                "\n",
+                "# Inference Time Scaling Contribution\n",
+                "contribution_name = \"inference-time-scaling\"\n",
+                "contribution_domain = \"Artificial Intelligence Research\" \n",
+                "contribution_summary = \"A Probabilistic Inference Approach to Inference-Time Scaling of Large Language Models (LLMs)\"\n",
+                "\n",
+                "# Add contribution information to the knowledge_contribution dictionary for it\n",
+                "knowledge_contribution = {\"name\": contribution_name, \"domain\": contribution_domain, \"summary\": contribution_summary}\n",
+                "contributions.append(knowledge_contribution)\n",
+                "\n",
+                "# NFL Rules Contribution\n",
+                "contribution2_name = \"nfl\"\n",
+                "contribution2_domain = \"sports rules\" \n",
+                "contribution2_summary = \"Official playing rules of the National Football League 2022, 2023\"\n",
+                "knowledge_contribution2 = {\"name\": contribution2_name, \"domain\": contribution2_domain, \"summary\": contribution2_summary}\n",
+                "contributions.append(knowledge_contribution2)\n",
+                "\n",
+                "for contribution in contributions:\n",
+                "    contribution_dir = WORKSPACE_DIR / contribution[\"name\"]\n",
+                "    contribution[\"dir\"] = contribution_dir\n",
+                "\n",
+                "    for subdir in [SOURCE_DOCUMENT_DIR, CONVERSION_DIR, CHUNKING_DIR, AUTHORING_DIR]:\n",
+                "        (contribution_dir / subdir).mkdir(parents=True, exist_ok=True)"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "id": "344b7ac5-fc2a-40a8-8e1f-e8dd8b1153e7",
+            "metadata": {},
+            "source": [
+                "## Data Gathering\n",
+                "\n",
+                "Copy each contribution file to the `WORKSPACE_DIR/<CONTRIBUTION NAME>/source_documents` directory for the following conversion step to detect them."
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": null,
+            "id": "26501e2f-7215-441f-9efa-075f87024893",
+            "metadata": {},
+            "outputs": [],
+            "source": [
+                "import shutil\n",
+                "\n",
+                "# Inference Time Scaling Contribution\n",
+                "orig_path = Path(\"sample-pdfs/inference-time-scaling.pdf\")\n",
+                "dst_path = WORKSPACE_DIR / contribution_name / SOURCE_DOCUMENT_DIR\n",
+                "\n",
+                "shutil.copy(orig_path, dst_path)\n",
+                "\n",
+                "# NFL Rules Contribution\n",
+                "rules_2022 = Path(\"sample-pdfs/2022-nfl-rulebook.pdf\")\n",
+                "rules_2023 = Path(\"sample-pdfs/2023-nfl-rulebook.pdf\")\n",
+                "rules_dst = WORKSPACE_DIR / contribution2_name / SOURCE_DOCUMENT_DIR\n",
+                "\n",
+                "shutil.copy(rules_2022, rules_dst)\n",
+                "shutil.copy(rules_2023, rules_dst) "
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "id": "68478061",
+            "metadata": {},
+            "source": [
+                "Review this list of files to verify that all expected files are included in each of the contributions."
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": null,
+            "id": "5325fa71-d09f-457f-9e55-be106dcf78e0",
+            "metadata": {},
+            "outputs": [],
+            "source": [
+                "print(\"Files to pre-process\n--------------------\")\n",
+                "for contribution in contributions:\n",
+                "    print(f\"\nContribution: {contribution.get(\"name\")}\")\n",
+                "    print(\"Files:\")\n",
+                "    files = list((contribution['dir'] / SOURCE_DOCUMENT_DIR).glob(\"*.pdf\"))\n",
+                "    for file in files:\n",
+                "        print(file.resolve())"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "id": "8a4904e6-8e12-4473-8301-cba90e61bd8b",
+            "metadata": {},
+            "source": [
+                "## Document Conversion\n",
+                "\n",
+                "This notebook uses [Docling](https://github.com/docling-project/docling) to convert any type of document into a Docling Document: a structured representation of the original document that can be exported as JSON. The resulting JSON output is used in the following step, which performs Docling's chunking methods."
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": null,
+            "id": "b91d4b2e-19cd-46e7-a912-ba9b2904c7cd",
+            "metadata": {},
+            "outputs": [],
+            "source": [
+                "!pip install -qq docling"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "id": "749fb64b-d089-4844-9330-7f3639819e7a",
+            "metadata": {},
+            "source": [
+                "### Configure Docling conversion pipeline\n",
+                "\n",
+                "Next we set the configuration options for our conversion pipeline. The PDF Conversion options set here are the defaults. More information about pipeline configuration can be found on Docling.\n",
+                "\n",
+                "For a complete reference on Docling conversion pipeline configuration, see [PDFPipelineOptions](https://docling-project.github.io/docling/reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions) and [PDFFormatOptions](https://docling-project.github.io/docling/reference/document_converter/#docling.document_converter.InputFormat.XML_JATS)."
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": null,
+            "id": "157c5e02-edd1-44f6-b20f-f6b4bda1aae7",
+            "metadata": {},
+            "outputs": [],
+            "source": [
+                "from docling.document_converter import DocumentConverter, PdfFormatOption\n",
+                "from docling.datamodel.base_models import InputFormat\n",
+                "from docling.datamodel.pipeline_options import PdfPipelineOptions\n",
+                "\n",
+                "pipeline_options = PdfPipelineOptions() # TODO: show the options that can be set\n",
+                "\n",
+                "doc_converter = DocumentConverter(\n",
+                "    format_options={\n",
+                "        InputFormat.PDF: PdfFormatOption(\n",
+                "            pipeline_options=pipeline_options\n",
+                "        )\n",
+                "    }\n",
+                ")"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "id": "73400c74-dead-4998-aee2-ddb00ddaa276",
+            "metadata": {},
+            "source": [
+                "Finally, we convert every document into Docling JSON as long as it is a valid file type to be converted"
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": null,
+            "id": "a200039c-b8b2-4087-88ba-7bfb0e393cc9",
+            "metadata": {},
+            "outputs": [],
+            "source": [
+                "import json\n",
+                "\n",
+                "json_files=[]\n",
+                "for contribution in contributions:\n",
+                "    files = list((contribution[\"dir\"] / SOURCE_DOCUMENT_DIR).glob(\"*.pdf\"))\n",
+                "                 \n",
+                "    for file in files:\n",
+                "        doc = doc_converter.convert(source=file).document\n",
+                "        doc_dict = doc.export_to_dict()\n",
+                "   \n",
+                "        conversion_output_dir = contribution[\"dir\"] / CONVERSION_DIR\n",
+                "        conversion_output_dir.mkdir(parents=True, exist_ok=True)\n",
+                "        \n",
+                "        json_output_path = conversion_output_dir / f\"{file.stem}.json\"\n",
+                "        with open(json_output_path, \"w\") as f:\n",
+                "            json.dump(doc_dict, f)\n",
+                "            print(f\"Path of JSON output is: {Path(json_output_path).resolve()}\")\n",
+                "            json_files.append(json_output_path.resolve())"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "id": "40710019-7ec9-414e-ad72-1ba672cf5fc2",
+            "metadata": {},
+            "source": [
+                "### Post-Conversion: Illuminator Analysis"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "id": "2572e2d0-94dc-4ca0-b032-3978af26c9c9",
+            "metadata": {},
+            "source": [
+                "The output of document conversion is not always perfect. Data may become distorted or corrupted, which can negatively affect a model's performance after training. While optional, reviewing your converted data is strongly recommended. The following example explains how to use the Illuminator tool to identify common conversion issues."
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": null,
+            "id": "09e07e35-befb-4ed5-9fe4-41544f88d943",
+            "metadata": {},
+            "outputs": [],
+            "source": [
+                "from utils.illuminator.analysis import analyze_docling_tables\n",
+                "from utils.illuminator.utils import generate_summary\n",
+                "from docling.datamodel.document import DoclingDocument\n",
+                "\n",
+                "import json\n",
+                "import sys\n",
+                "from pathlib import Path\n",
+                "\n",
+                "for contribution in contributions:\n",
+                "    conversion_dir = contribution[\"dir\"] / CONVERSION_DIR\n",
+                "    converted_json_paths = list(conversion_dir.glob(\"*.json\"))\n",
+                "    results = {}\n",
+                "    \n",
+                "    for path in converted_json_paths:\n",
+                "        with open(path, \"r\") as f:\n",
+                "            doc_dict = json.load(f)\n",
+                "    \n",
+                "        doc = DoclingDocument(**doc_dict)\n",
+                "        results[path] = analyze_docling_tables(doc)\n",
+                "    \n",
+                "        summary_path = contribution[\"dir\"] / CONVERSION_DIR / f\"illuminator-readable-summary-{doc.name}.txt\"\n",
+                "        \n",
+                "        with open(summary_path, \"w\") as f:\n",
+                "            generate_summary(results, file=f)\n",
+                "        \n",
+                "        print(f\"✅ Post-conversion summary saved to: {summary_path.resolve()}\")"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "id": "eea0876e-ac55-45fc-93e8-3e646a6c3104",
+            "metadata": {},
+            "source": [
+                "\n",
+                "The output of this post-conversion step should help determine whether to avoid using the content for chunking entirely or to manually edit it before proceeding with chunking.\n"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "id": "cafad55e-a4c0-4d6e-9da0-49519fa9bf74",
+            "metadata": {},
+            "source": [
+                "## Chunking\n",
+                "\n",
+                "The goal of chunking the converted documents is to provide the teacher model small and logical pieces of the source document to generate data off of.\n",
+                "\n",
+                "In this notebook we are doing chunking with [Docling](https://docling-project.github.io/docling/examples/hybrid_chunking/#hybrid-chunking).\n",
+                "\n",
+                "The input to this notebook is a docling JSON file created after a docling conversion, or a directory of docling JSON files."
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "id": "2482060c-a49f-4345-aa47-d54301939387",
+            "metadata": {},
+            "source": [
+                "### Initialize the Chunker\n",
+                "\n",
+                "Docling provides two chunkers, the `HierarchicalChunker` and the `HybridChunker`.\n",
+                "The `HierarchicalChunker` creates chunks based on the hierarchy in the Docling document\n",
+                "\n",
+                "The `HybridChunker` builds on the `HierarchicalChunker` and by making it tokenization aware.\n",
+                "\n",
+                "The `HybridChunker` has options for a `tokenizer`, the `max_tokens` in a chunk, and whether to merge undersized peer chunks. Uncomment the commented out code to configure these."
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": null,
+            "id": "50df9d91-add4-46a1-a69d-0f7f9f69542e",
+            "metadata": {},
+            "outputs": [],
+            "source": [
+                "#from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer\n",
+                "#from transformers import AutoTokenizer\n",
+                "\n",
+                "from docling.chunking import HybridChunker\n",
+                "\n",
+                "#EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\n",
+                "#MAX_TOKENS = 1024\n",
+                "#\n",
+                "# tokenizer = HuggingFaceTokenizer(\n",
+                "#     tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),\n",
+                "#     max_tokens=MAX_TOKENS,  # optional, by default derived from `tokenizer` for HF case\n",
+                "#     merge_peers=True # \n",
+                "# )\n",
+                "\n",
+                "chunker = HybridChunker(\n",
+                "    #tokenizer=tokenizer,\n",
+                "    #merge_peers=True,  # whether to merge undersized chunks - defaults to True\n",
+                ")"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "id": "54ce1d6f-b8d3-470c-b3c9-675911f0ee92",
+            "metadata": {},
+            "source": [
+                "### Load and chunk the converted docling document\n",
+                "\n",
+                "Next lets convert the document we want to chunk up into a Docling Document.\n",
+                "\n",
+                "The resulting chunks are stored in a file called chunks.jsonl in the `chunks` directory in your contribution. This file is used as an input in a later step when creating the seed dataset for SDG."
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": null,
+            "id": "db983c05-4aa6-4261-9283-2adab69bfbd3",
+            "metadata": {},
+            "outputs": [],
+            "source": [
+                "import json\n",
+                "from docling.document_converter import DocumentConverter\n",
+                "\n",
+                "all_chunks = []\n",
+                "\n",
+                "for contribution in contributions:\n",
+                "    conversion_dir = contribution[\"dir\"] / CONVERSION_DIR\n",
+                "    json_files = list(conversion_dir.glob(\"*.json\"))\n",
+                "    chunking_output_dir = contribution[\"dir\"] / CHUNKING_DIR\n",
+                "    chunking_output_dir.mkdir(parents=True, exist_ok=True)\n",
+                "    contribution_chunks = []\n",
+                "    \n",
+                "    for file in json_files:\n",
+                "        # reconvert the docling JSON for chunking\n",
+                "        doc = DocumentConverter().convert(source=file)\n",
+                "        \n",
+                "        chunk_iter = chunker.chunk(dl_doc=doc.document)\n",
+                "        chunk_objs = list(chunk_iter)\n",
+                "    \n",
+                "        print(f\"Extracted {len(chunk_objs)} chunks from {doc.document.name}\")\n",
+                "        \n",
+                "        for chunk in chunk_objs:\n",
+                "            c = dict(chunk=chunker.contextualize(chunk=chunk), file=doc.document.name,metadata=chunk.meta.export_json_dict())\n",
+                "            contribution_chunks.append(c)\n",
+                "            all_chunks.append(c)\n",
+                "\n",
+                "\n",
+                "        chunks_file_path = chunking_output_dir / \"chunks.jsonl\"\n",
+                "        with open(chunks_file_path, \"w\", encoding=\"utf-8\") as file:\n",
+                "            for chunk in contribution_chunks:\n",
+                "                json.dump(chunk, file)\n",
+                "                file.write(\"\n\")\n",
+                "            print(f\"Path of chunks JSON is: {Path(chunks_file_path).resolve()}\")"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "id": "0fb38545-eb84-4923-8fc4-d10ed08eab26",
+            "metadata": {},
+            "source": [
+                "### View the Chunks"
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": null,
+            "id": "4fdf34c7-9829-43d2-bf9f-7d1d55bb6a4c",
+            "metadata": {},
+            "outputs": [],
+            "source": [
+                "chunk_gen = iter(all_chunks)"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "id": "811992ac",
+            "metadata": {},
+            "source": [
+                "To view the chunks one by one, rerun the following cell. The document is now broken into small sections with metadata about the chunk based on the document's format."
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": null,
+            "id": "ee9a8531",
+            "metadata": {},
+            "outputs": [],
+            "source": [
+                "print(next(chunk_gen)['chunk'])"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "id": "a510f8c7-8cd3-4867-8742-9f4f9cda9e9f",
+            "metadata": {},
+            "source": [
+                "## Authoring\n",
+                "\n",
+                "To start the synthetic data generation process, users need to prepare a diverse set of questions and answers based off chunks from each source document. A chunk and question-and-answer pairs are called a seed example."
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "id": "f3490c8a-5ee8-44cd-ae5e-26a6ca7b4017",
+            "metadata": {},
+            "source": [
+                "### Install docling-sdg\n",
+                "\n",
+                "[Docling-sdg](https://github.com/docling-project/docling-sdg) project is used to generate question and answer pairs for seed examples."
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": null,
+            "id": "86c48e52-cda7-48ac-84dc-0b844aed5f98",
+            "metadata": {},
+            "outputs": [],
+            "source": [
+                "!pip install -qq docling-sdg"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "id": "d65ec755-e3de-40ab-bf3a-23ebb29a705d",
+            "metadata": {},
+            "source": [
+                "### Initialize QA generator model & Number of Seed examples\n",
+                "\n",
+                "To generate seed examples you need to set: \n",
+                "1. The the Open AI compatible endpoint for the model generating question and answer pairs\n",
+                "2. The model's API key\n",
+                "3. The model's name\n",
+                "4. The number of seed example you wish to generate for each contribution"
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": null,
+            "id": "874d4de8",
+            "metadata": {
+                "tags": [
+                    "parameters"
+                ]
+            },
+            "outputs": [],
+            "source": [
+                "API_KEY = \"none\"  # the API access key for your account ( cannot be empty )\n",
+                "API_URL = \"http://127.0.0.1:11434/v1\"  # the URL of your model's API\n",
+                "MODEL_ID = \"granite3.3\" # the name of your model"
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": null,
+            "id": "b702267e-f550-4bc2-bce4-c0fcecbbd292",
+            "metadata": {},
+            "outputs": [],
+            "source": [
+                "import os\n",
+                "\n",
+                "generate_options = GenerateOptions(project_id=\"project_id\")\n",
+                "generate_options.provider = LlmProvider.OPENAI_LIKE\n",
+                "generate_options.api_key = SecretStr(API_KEY)\n",
+                "generate_options.url = API_URL\n",
+                "generate_options.model_id = MODEL_ID"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "id": "32e13a94-1c5e-4310-9500-6940368ec2ea",
+            "metadata": {},
+            "source": [
+                "### [OPTIONAL] Prompt customization for Q&A Generation\n",
+                "\n",
+                "The below cell modifies the default prompt used by `docling-sdg` for Q&A Generation by adding a cutomization statement. \n",
+                "\n",
+                "Insert your own stylistic customization statement below and run the rest of the cells in this section if you would like to stylistically customize Q&A generation."
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": null,
+            "id": "78b51f57-7b7b-4d53-a129-29c291939dae",
+            "metadata": {},
+            "outputs": [],
+            "source": [
+                "customization_str = \"Write at the fifth grade level.\""
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": null,
+            "id": "d6e8dba7-8798-429b-9f46-806111ce6e6c",
+            "metadata": {},
+            "outputs": [],
+            "source": [
+                "from docling_sdg.qa.prompts.generation_prompts import QaPromptTemplate\n",
+                "\n",
+                "\n",
+                "CUSTOM_COMBINED_QUESTION_PROMPT =  (\n",
+                "    \"I will provide you a text passage. I need you to generate three questions that \"\n",
+                "    \"must be answered only with information contained in this passage, and nothing \"\n",
+                "    \"else.\n\"\n",
+                "    'The first question is of type \"fact_single\", which means that the answer to this '\n",
+                "    \"question is a simple, single piece of factual information contained in the \"\n",
+                "    \"context.\n\"\n",
+                "    'The second question is of type \"summary\", which means that the answer to this '\n",
+                "    \"question summarizes different pieces of factual information contained in the \"\n",
+                "    \"context.\n\"\n",
+                "    'The third question is of type \"reasoning\", which is a question that requires the '\n",
+                "    \"reader to think critically and make an inference or draw a conclusion based on \"\n",
+                "    \"the information provided in the passage.\n\"\n",
+                "    \"Make sure that the three questions are different.\n\"\n",
+                "    \"\n\"\n",
+                "    \"You will format your generation as a python dictionary, such as:\n\"\n",
+                "    '{\"fact_single\": <The \"fact_single\" type question you thought of>, '\n",
+                "    '\"fact_single_answer: <Answer to the \"fact_single\" question>, \"summary\": <the '\n",
+                "    '\"summary\" type question you thought of>, \"summary_answer\": <Answer to the '\n",
+                "    '\"summary\" question>, \"reasoning\": <the \"reasoning\" type question you thought '\n",
+                "    'of>, \"reasoning_answer\": <Answer to the \"reasoning\" question>}\n'\n",
+                "    \"\n\"\n",
+                "    \"Only provide the python dictionary as your output. Make sure you provide an answer for each question.\n\"\n",
+                "    \"{customization_str}\"\n",
+                "    \"\n\"\n",
+                "    \"Context: {context_str}\"\n",
+                ")\n",
+                "\n",
+                "custom_combined_question_qa_prompt: QaPromptTemplate = QaPromptTemplate(\n",
+                "    template=CUSTOM_COMBINED_QUESTION_PROMPT,\n",
+                "    keys=[\"context_str\", \"customization_str\"],\n",
+                "    labels=[\"fact_single\", \"summary\", \"reasoning\"],\n",
+                "    type_=\"question\",\n",
+                ")\n",
+                "\n",
+                "generate_options.prompts = [custom_combined_question_qa_prompt]"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "id": "919199c0-3747-409a-85ab-0155ef3ebe9d",
+            "metadata": {},
+            "source": [
+                "### Configure subset selection"
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": null,
+            "id": "f1197d4e-8354-45e3-9ec9-85c78ba36548",
+            "metadata": {},
+            "outputs": [],
+            "source": [
+                "NUM_CHUNKS_TO_SELECT_FOR_AUTHORING = 5"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "id": "d2421d07-3e6c-4355-95f4-da8e157557c7",
+            "metadata": {},
+            "source": [
+                "### Run QA generation on selected chunks"
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": null,
+            "id": "e57edff5-9a13-47fb-9248-9140ae5baaca",
+            "metadata": {},
+            "outputs": [],
+            "source": [
+                "from utils.qna_gen import generate_seed_examples\n",
+                "\n",
+                "for contribution in contributions:\n",
+                "    chunks_jsonl_path = contribution[\"dir\"] / CHUNKING_DIR / \"chunks.jsonl\"\n",
+                "    authoring_path = contribution[\"dir\"] / AUTHORING_DIR\n",
+                "\n",
+                "    qna_output_path = generate_seed_examples(contribution[\"name\"],\n",
+                "                           chunks_jsonl_path,\n",
+                "                           authoring_path,\n",
+                "                           contribution[\"domain\"],\n",
+                "                           contribution[\"summary\"],\n",
+                "                           NUM_SEED_EXAMPLES,\n",
+                "                           API_KEY,\n",
+                "                           ENDPOINT_URL,\n",
+                "                           MODEL_NAME)\n",
+                "    print(f\"qna.yaml saved to: {qna_output_path}\")\n"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "id": "6c574f96-5860-48b9-b4ac-01d367c7717b",
+            "metadata": {},
+            "source": [
+                "### Review and Revise Seed Examples\n",
+                "\n",
+                "A quality set of seed examples has diverse contexts and question-and-answer pairs across every seed example. You can asses the `qna.yaml` files in your preferred text editor to ensure the quality, diversity, and style of generated questions and answers, and modify them accordingly.\n",
+                "\n",
+                "After assessment, the `qna.yaml` files can be quickly reviewed to ensure they includes the required elements and correct number of each. It is recommended to have at least 5 seed examples. Each seed example must have 3 question and answer pairs."
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": null,
+            "id": "da3ef131-e5a3-4854-b6e9-3277273a91dd",
+            "metadata": {},
+            "outputs": [],
+            "source": [
+                "from utils.qna_gen import review_seed_examples_file\n",
+                "\n",
+                "\n",
+                "\n",
+                "for contribution in contributions:\n",
+                "        qna_path = contribution[\"dir\"] / AUTHORING_DIR / \"qna.yaml\"\n",
+                "        review_seed_examples_file(qna_path, min_seed_examples=5, num_qa_pairs=3)"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "id": "1f101076-a50f-49ea-a83b-46eaa8b39cc4",
+            "metadata": {},
+            "source": [
+                "## Create Seed Dataset for SDG\n",
+                "\n",
+                "This step creates the seed data for SDG. This data is a JSON filed that contains a combination of the `seed_examples` in the qna.yaml and the chunks from the source document. \n",
+                "\n",
+                "Intermediate seed data files are created for each contribution with the contribution's name included in the file name. For example in the `nfl` contribution, a file containing seed data called `seed_data-nfl.jsonl` would be created in `$WORKSPACE_DIR/nfl`. This file contains a combination of all of the chunks from the NFL source documents and the seed examples in the `qna.yaml` in `$WORKSPACE_DIR/nfl/authoring`.\n",
+                "\n",
+                "After seed data files are created for each contribution, a final `seed_data.jsonl` is created in `$WORKSPACE_DIR`. This file is a concatenation of all of the intermediate `seed_data-{contribution name}.jsonl` files and should be used as an input to SDG."
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": null,
+            "id": "e2c6e31b-e8a9-406c-b2dc-27433c8fd8ec",
+            "metadata": {},
+            "outputs": [],
+            "source": [
+                "!pip install -qq datasets transformers"
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": null,
+            "id": "ab2c9ed2-8ba8-4959-8e01-81625b81d286",
+            "metadata": {},
+            "outputs": [],
+            "source": [
+                "from utils.create_seed_dataset import get_seed_dataset, safe_concatenate_datasets\n",
+                "\n",
+                "contribution_datasets = []\n",
+                "for contribution in contributions:\n",
+                "    chunks_dir = contribution[\"dir\"] / CHUNKING_DIR\n",
+                "    qna_dir = contribution[\"dir\"] / AUTHORING_DIR\n",
+                "    seed_data = get_seed_dataset(chunks_dir, qna_dir)\n",
+                "    output_path = f'{contribution_dir}/seed_data-{contribution_name}.jsonl'\n",
+                "    seed_data.to_json(output_path, orient='records', lines=True)\n",
+                "    contribution_datasets.append(seed_data)\n",
+                "    print(f\"Intermediate results saved to: {output_path}\")\n",
+                "\n",
+                "final_seed_data = safe_concatenate_datasets(contribution_datasets)\n",
+                "output_path = f'{WORKSPACE_DIR}/seed_data.jsonl'\n",
+                "final_seed_data.to_json(output_path, orient='records', lines=True)\n",
+                "\n",
+                "print(f\"Final seed data contains {final_seed_data.data.num_rows} rows\")\n",
+                "print(f\"Final seed data for SDG saved to: {output_path}\")"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "id": "50ff36f4-19fc-4a27-b51a-3688e7b630e4",
+            "metadata": {},
+            "source": [
+                "### Inspect the seed data"
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": null,
+            "id": "a6936825-31c1-4b46-a1af-2fb46f50158d",
+            "metadata": {},
+            "outputs": [],
+            "source": [
+                "print(seed_data.data.table.slice(length=1))"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "id": "24a8fcdb-8035-4f30-b856-46afe9f928a1",
+            "metadata": {},
+            "source": [
+                "# Summary\n",
+                "\n",
+                "To recap, given source documents in PDF format, this notebook:\n",
+                "\n",
+                "1. Converts the documents using Docling and saves in the Docling Document format\n",
+                "2. Splits the extracted text into chunks of JSON\n",
+                "3. Generates Q&A pairs for a subset of those chunks\n",
+                "4. Creates a `qna.yaml` available for inspection and revision\n",
+                "5. Combines the chunks and `qna.yaml` to create a `seed_data.jsonl` to use for SDG\n",
+                "\n",
+                "The next step is to use the resulting `seed_data.jsonl` for SDG, such as illustrated in [this notebook](https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub/blob/main/examples/instructlab/knowledge/knowledge_generation_and_mixing.ipynb)."
+            ]
+        }
+    ],
+    "metadata": {
+        "kernelspec": {
+            "display_name": "Python 3 (ipykernel)",
+            "language": "python",
+            "name": "python3"
+        },
+        "language_info": {
+            "codemirror_mode": {
+                "name": "ipython",
+                "version": 3
+            },
+            "file_extension": ".py",
+            "mimetype": "text/x-python",
+            "name": "python",
+            "nbconvert_exporter": "python",
+            "pygments_lexer": "ipython3",
+            "version": "3.11.9"
+        }
+    },
+    "nbformat": 4,
+    "nbformat_minor": 5
 }