Add ability for `instructlab-knowledge` notebook to take multiple source and qna files #18

alimaredia · 2025-05-23T16:13:23Z

instructlab-knowledge.ipynb only accepted one PDF file and created one qna.yaml file. This PR allows multiple .pdf files to be converted and chunked and allows multiple knowledge contributions each of which can have multiple source pdf files and have one qna.yaml generated.

This PR also contains minor cleanup also a changing to chunking code from having chunks in individual .txt files to all of the chunks being moved into a JSONL file.

anastasds · 2025-06-05T15:03:24Z

What's the design goal behind the new contributions idea? There's a lot of newly introduced magic strings that have come in with it.

Signed-off-by: Ali Maredia <[email protected]>

Other minor cleanup to remove unused variable in authoring code Signed-off-by: Ali Maredia <[email protected]>

Signed-off-by: Ali Maredia <[email protected]>

Signed-off-by: Khaled Sulayman <[email protected]>

…l authoring Signed-off-by: Khaled Sulayman <[email protected]>

Signed-off-by: Ali Maredia <[email protected]>

This commit adds the following:: - extensive documentation on what a knowledge contribution is - a check on the qna.yaml file to make sure all the fields exist for the file to be used in the next step. - Use os.getenv() to get environment variables for q&a generation model - Bumps python version in CI to 3.12 - Runs the Illuminator tool over all contributions Signed-off-by: Ali Maredia <[email protected]>

alimaredia · 2025-06-09T15:00:51Z

@anastasds A contribution is just groupings of one or more source documents that results in the qna.yaml. I added documentation in my latest changes that describe what contributions are.

I tried my best to minimize the number of variables users need to set, but the only net new variable should just be the directory the contribution artifacts live in.

Ryfernandes

Everything ran correctly, and the documentation was clear enough for me to follow the contributions system. A couple of thoughts, although not sure if these should be addressed here or as separate features in separate pr's:

Since we are using os.environ() for the authoring section, it could be helpful to provide and reference an example.env file
I have mentioned this before/have built it into my UI, but when users upload multiple files, it would be useful to have multiple pipelines with conversion settings for Docling and to be able to assign them to different documents. For example here, the research paper could benefit from image classification/description and formula enrichment, whereas these are largely unnecessary/a waste of compute for the NFL rulebook. I have a notebook here of a tool I wrote that starts experimenting with some methods of doing this (just specifying conversion settings and adding aliases, not assigning them yet).

iamemilio · 2025-06-09T17:35:16Z

I agree with Ryan about wanting different conversion options based on document layout and composition.

I am starting to wonder if the instructlab-knoweldge notebook is doing too much. It seems like what we want to do is to write a data conversion pipeline tool that wraps docling. If that's what we are delivering, would it be better off as a product, with standards, testing, explicit inputs and outputs, an sdk/cli/ui... etc, rather than an increasingly complex script that you run in a notebook?

notebooks/instructlab-knowledge/instructlab-knowledge.ipynb

khaledsulayman · 2025-06-09T18:36:37Z

I am starting to wonder if the instructlab-knoweldge notebook is doing too much, and I am not really sure why we are forcing it to be in a notebook, when it seems like what we want to do is to write a data conversion pipeline tool that wraps docling. If that's what we are delivering, wouldn't it be better off as a product, with standards, testing, explicit inputs and outputs, an sdk/cli/ui... etc, rather than an increasingly complex script that you run in a notebook?

@iamemilio I see where you're coming from, however I would lean against drawing boxes around any particular part of the workflow at the moment. Modularization is definitely the ultimate goal for a product, but from my understanding the initial push towards notebooks was so they could serve as a bucket of code where we could get an unbiased sense of all the customizability we would want. I think it makes sense to build this up to include all the customizability we would want and then break it down into more concrete modules once product requirements are clearer.

notebooks/instructlab-knowledge/instructlab-knowledge.ipynb

kelbrown20 · 2025-06-09T19:28:15Z

notebooks/instructlab-knowledge/instructlab-knowledge.ipynb

    "# Summary\n",
    "\n",
-    "To recap, given a source document in PDF format, this notebook:\n",
+    "To recap, given a source documents in PDF format, this notebook:\n",


Are you thinking of a specific tense for this, because before the tenses varied a lot and were inconsistent. It does matter to me which one, It should all just be in the same tense

What would you suggest the tense we should stick with? Maybe we do a follow up PR aligning all of the tenses in all of the documentation within each section?

CC: @JustinXHale

notebooks/instructlab-knowledge/instructlab-knowledge.ipynb

notebooks/instructlab-knowledge/utils/qna_gen.py

fabianofranz · 2025-06-09T21:42:45Z

.github/workflows/instructlab-knowledge-e2e.yml

              uses: actions/setup-python@v5
              with:
-                python-version: '3.11'
+                python-version: '3.12'


Just FYI, I successfully ran it end-to-end on 3.11 with just one minor adjustment. On notebooks/instructlab-knowledge/instructlab-knowledge.ipynb#232 , use contribution['name'] (single quotes).

I made that adjustment but I was hoping to use this PR as an opportunity to bump the python version for the notebook entirely. I'm not sure how high or low a user could go. If we say we're going to support python 3.11 and 3.12 then we should have a CI job for each one.

Signed-off-by: Ali Maredia <[email protected]>

fabianofranz · 2025-06-10T14:29:41Z

Not only related to this PR, but I realized that the sample documents we provide don't generate the correct 3 answer/question pairs. While that's non-deterministic, should we switch to samples that have higher chance of generating the pairs correctly, so the notebook runs cleanly end-to-end without the need of user review/intervention? /cc @alimaredia

notebooks/instructlab-knowledge/instructlab-knowledge.ipynb

Signed-off-by: Ali Maredia <[email protected]>

Docling-sdg routinely generated less than 5 seed examples so the number is being set higher. The notebook and documentation has been reworded to prescribe a minumum number of seed examples to check againist. Signed-off-by: Ali Maredia <[email protected]>

alimaredia · 2025-06-11T13:07:38Z

Not only related to this PR, but I realized that the sample documents we provide don't generate the correct 3 answer/question pairs. While that's non-deterministic, should we switch to samples that have higher chance of generating the pairs correctly, so the notebook runs cleanly end-to-end without the need of user review/intervention? /cc @alimaredia

That's a good call. Actually had the idea of removing the Inference time scaling contribution (which was the one I found throwing more errors) and using Fifa soccer rules to have the examples contributions closer in theme and contents.

One other thing to note is that the NFL documents were greatly slimmed down for conversions in our CI to take less time. I think adding pages back into those documents might improve how often the proper number of q&a pairs are generate correctly.

Since this PR has so many changes could I add a follow up with those changes?

alimaredia · 2025-06-11T13:45:35Z

Everything ran correctly, and the documentation was clear enough for me to follow the contributions system. A couple of thoughts, although not sure if these should be addressed here or as separate features in separate pr's:

* Since we are using os.environ() for the authoring section, it could be helpful to provide and reference an example.env file

* I have mentioned this before/have built it into my UI, but when users upload multiple files, it would be useful to have multiple pipelines with conversion settings for Docling and to be able to assign them to different documents. For example here, the research paper could benefit from image classification/description and formula enrichment, whereas these are largely unnecessary/a waste of compute for the NFL rulebook. I have a [notebook here](https://github.com/Ryfernandes/docling-table-omission) of a tool I wrote that starts experimenting with some methods of doing this (just specifying conversion settings and adding aliases, not assigning them yet).

@Ryfernandes Emilio had suggested having a data structure for contributions. That seems like it would be a pre-requisite to being able to call into multiple pipelines. I mentioned in a previous comment this should be follow up work after this PR is merged.

alimaredia · 2025-06-11T13:52:35Z

@iamemilio That's a really good point about the scope of the notebook. I think splitting this notebook back out into 3 more well defined notebooks is going to be the way we'll go in the future, but I'd like to get that feedback from a few new users first.

alimaredia · 2025-06-11T14:12:10Z

@iamemilio @alinaryan @kelbrown20 @fabianofranz @Ryfernandes I've addressed most of your comments thanks for the feedback. To not keep increasing the scope of this PR there's 3 areas of work I'm going to open upstream issues on after this PR is merged:

Creating a Contribution data structure to enable all of the state of each contribution to be stored
Ensure that the example documents allow users to run through the notebook more consistently generates a proper qna.yaml
Do a docs review and ensure that there's consistency across the entire notebook

What do you think?

This commit adds constants for all of the subdirectories within a contribution. It also changes the input of the qna.yaml reviewer to set where the chunks.jsonl is coming from and where the output should be. Signed-off-by: Ali Maredia <[email protected]>

fabianofranz · 2025-06-11T20:17:07Z

@alimaredia 👍 sound good to me.

Ryfernandes · 2025-06-11T21:05:13Z

Yeah that sounds great

khaledsulayman

left 2 minor comments. Otherwise, everything looks functionally good. Thanks!

notebooks/instructlab-knowledge/utils/create_seed_dataset.py

notebooks/instructlab-knowledge/instructlab-knowledge.ipynb

Function is not called anywhere else. Signed-off-by: Ali Maredia <[email protected]>

Break initial setup into 2 sections: - Contribution overview which has all documentation about contributions - Getting started which has 2 cells where key variables are initialized. Signed-off-by: Ali Maredia <[email protected]>

alimaredia · 2025-06-12T11:12:30Z

@khaledsulayman Thanks for the comments! The function you pointed out was deleted since it's not being used anymore and I made some adjustments to the sections in the notebook and the documentation to make it easier to navigate back to the first two cells to re-run them. I was constantly doing that too.

iamemilio

+1 to follow up tasks. LGTM

alimaredia requested review from JustinXHale, anastasds, fabianofranz, iamemilio and khaledsulayman as code owners June 5, 2025 14:34

alimaredia requested a review from alinaryan as a code owner June 9, 2025 12:34

alimaredia added 3 commits June 9, 2025 08:34

instructlab-knowledge: save chunks in single chunks.jsonl file

14d5bb7

Signed-off-by: Ali Maredia <[email protected]>

Ensure chunking step is using docling json as input

52b5390

Other minor cleanup to remove unused variable in authoring code Signed-off-by: Ali Maredia <[email protected]>

Add ability to use multiple source files for a single qna.yaml

6bcefab

Signed-off-by: Ali Maredia <[email protected]>

alimaredia force-pushed the multiple-source-and-qna-files branch 3 times, most recently from aa072ea to 9b6b4f7 Compare June 9, 2025 14:11

alimaredia and others added 5 commits June 9, 2025 10:22

Add ability to take multiple knowledge contributions

cccf23d

Signed-off-by: Ali Maredia <[email protected]>

Consolidate contributions into one dir per contribution

1200ada

Signed-off-by: Ali Maredia <[email protected]>

ignore local workspaces

5bd6e38

Signed-off-by: Khaled Sulayman <[email protected]>

allow for steps to be run using artifacts from previous steps up unti…

f1bd6df

…l authoring Signed-off-by: Khaled Sulayman <[email protected]>

Refactored qna_gen code out of notebook resulting in qna_gen.py

491f17f

Signed-off-by: Ali Maredia <[email protected]>

alimaredia force-pushed the multiple-source-and-qna-files branch from 9b6b4f7 to e83a403 Compare June 9, 2025 14:25

alimaredia force-pushed the multiple-source-and-qna-files branch from e83a403 to 3d7d5b8 Compare June 9, 2025 14:26

Ryfernandes reviewed Jun 9, 2025

View reviewed changes

iamemilio reviewed Jun 9, 2025

View reviewed changes

notebooks/instructlab-knowledge/instructlab-knowledge.ipynb Outdated Show resolved Hide resolved

iamemilio reviewed Jun 9, 2025

View reviewed changes

notebooks/instructlab-knowledge/instructlab-knowledge.ipynb Outdated Show resolved Hide resolved

iamemilio reviewed Jun 9, 2025

View reviewed changes

notebooks/instructlab-knowledge/instructlab-knowledge.ipynb Show resolved Hide resolved

iamemilio reviewed Jun 9, 2025

View reviewed changes

notebooks/instructlab-knowledge/instructlab-knowledge.ipynb Outdated Show resolved Hide resolved

kelbrown20 reviewed Jun 9, 2025

View reviewed changes

fabianofranz reviewed Jun 9, 2025

View reviewed changes

notebooks/instructlab-knowledge/utils/qna_gen.py Outdated Show resolved Hide resolved

fabianofranz reviewed Jun 9, 2025

View reviewed changes

Delete workspaces dir, minor review_seed_examples fixes

b530505

Signed-off-by: Ali Maredia <[email protected]>

alinaryan reviewed Jun 10, 2025

View reviewed changes

notebooks/instructlab-knowledge/instructlab-knowledge.ipynb Outdated Show resolved Hide resolved

notebooks/instructlab-knowledge/instructlab-knowledge.ipynb Outdated Show resolved Hide resolved

alimaredia added 3 commits June 10, 2025 17:58

Revisions after code review

16a7c28

Signed-off-by: Ali Maredia <[email protected]>

Modify seed data creation docs and authoring variable names

9ead8ac

Signed-off-by: Ali Maredia <[email protected]>

fabianofranz self-requested a review June 11, 2025 20:18

fabianofranz approved these changes Jun 11, 2025

View reviewed changes

Ryfernandes approved these changes Jun 11, 2025

View reviewed changes

alimaredia mentioned this pull request Jun 11, 2025

Add optional section for Q&A stylistic customization #30

Closed

khaledsulayman reviewed Jun 11, 2025

View reviewed changes

notebooks/instructlab-knowledge/utils/create_seed_dataset.py Outdated Show resolved Hide resolved

notebooks/instructlab-knowledge/instructlab-knowledge.ipynb Outdated Show resolved Hide resolved

alimaredia added 2 commits June 12, 2025 06:57

Remove is_dir_valid() from create_seed_dataset.py

5b32d35

Function is not called anywhere else. Signed-off-by: Ali Maredia <[email protected]>

Slim down getting started section

0841462

Break initial setup into 2 sections: - Contribution overview which has all documentation about contributions - Getting started which has 2 cells where key variables are initialized. Signed-off-by: Ali Maredia <[email protected]>

iamemilio approved these changes Jun 12, 2025

View reviewed changes

fabianofranz merged commit 32e1661 into instructlab:main Jun 12, 2025
1 check passed

Add ability for instructlab-knowledge notebook to take multiple source and qna files #18

Add ability for instructlab-knowledge notebook to take multiple source and qna files #18

Uh oh!

Conversation

alimaredia commented May 23, 2025

Uh oh!

anastasds commented Jun 5, 2025

Uh oh!

alimaredia commented Jun 9, 2025

Uh oh!

Ryfernandes left a comment

Choose a reason for hiding this comment

Uh oh!

iamemilio commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

khaledsulayman commented Jun 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kelbrown20 Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

alimaredia Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fabianofranz Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

alimaredia Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

fabianofranz commented Jun 10, 2025

Uh oh!

Uh oh!

Uh oh!

alimaredia commented Jun 11, 2025

Uh oh!

alimaredia commented Jun 11, 2025

Uh oh!

alimaredia commented Jun 11, 2025

Uh oh!

alimaredia commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fabianofranz commented Jun 11, 2025

Uh oh!

Ryfernandes commented Jun 11, 2025

Uh oh!

khaledsulayman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alimaredia commented Jun 12, 2025

Uh oh!

iamemilio left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Add ability for `instructlab-knowledge` notebook to take multiple source and qna files #18

Add ability for `instructlab-knowledge` notebook to take multiple source and qna files #18

iamemilio commented Jun 9, 2025 •

edited

Loading

alimaredia commented Jun 11, 2025 •

edited

Loading