Skip to content

Conversation

@alimaredia
Copy link
Contributor

instructlab-knowledge.ipynb only accepted one PDF file and created one qna.yaml file. This PR allows multiple .pdf files to be converted and chunked and allows multiple knowledge contributions each of which can have multiple source pdf files and have one qna.yaml generated.

This PR also contains minor cleanup also a changing to chunking code from having chunks in individual .txt files to all of the chunks being moved into a JSONL file.

@anastasds
Copy link
Contributor

What's the design goal behind the new contributions idea? There's a lot of newly introduced magic strings that have come in with it.

@alimaredia alimaredia requested a review from alinaryan as a code owner June 9, 2025 12:34
Other minor cleanup to remove unused variable in
authoring code

Signed-off-by: Ali Maredia <[email protected]>
@alimaredia alimaredia force-pushed the multiple-source-and-qna-files branch 3 times, most recently from aa072ea to 9b6b4f7 Compare June 9, 2025 14:11
@alimaredia alimaredia force-pushed the multiple-source-and-qna-files branch from 9b6b4f7 to e83a403 Compare June 9, 2025 14:25
This commit adds the following::
- extensive documentation on what a knowledge contribution is
- a check on the qna.yaml file to make sure all the fields exist
for the file to be used in the next step.
- Use os.getenv() to get environment variables for q&a generation model
- Bumps python version in CI to 3.12
- Runs the Illuminator tool over all contributions

Signed-off-by: Ali Maredia <[email protected]>
@alimaredia alimaredia force-pushed the multiple-source-and-qna-files branch from e83a403 to 3d7d5b8 Compare June 9, 2025 14:26
@alimaredia
Copy link
Contributor Author

@anastasds A contribution is just groupings of one or more source documents that results in the qna.yaml. I added documentation in my latest changes that describe what contributions are.

I tried my best to minimize the number of variables users need to set, but the only net new variable should just be the directory the contribution artifacts live in.

Copy link

@Ryfernandes Ryfernandes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything ran correctly, and the documentation was clear enough for me to follow the contributions system. A couple of thoughts, although not sure if these should be addressed here or as separate features in separate pr's:

  • Since we are using os.environ() for the authoring section, it could be helpful to provide and reference an example.env file
  • I have mentioned this before/have built it into my UI, but when users upload multiple files, it would be useful to have multiple pipelines with conversion settings for Docling and to be able to assign them to different documents. For example here, the research paper could benefit from image classification/description and formula enrichment, whereas these are largely unnecessary/a waste of compute for the NFL rulebook. I have a notebook here of a tool I wrote that starts experimenting with some methods of doing this (just specifying conversion settings and adding aliases, not assigning them yet).

@iamemilio
Copy link
Contributor

iamemilio commented Jun 9, 2025

I agree with Ryan about wanting different conversion options based on document layout and composition.

I am starting to wonder if the instructlab-knoweldge notebook is doing too much. It seems like what we want to do is to write a data conversion pipeline tool that wraps docling. If that's what we are delivering, would it be better off as a product, with standards, testing, explicit inputs and outputs, an sdk/cli/ui... etc, rather than an increasingly complex script that you run in a notebook?

@khaledsulayman
Copy link
Member

I am starting to wonder if the instructlab-knoweldge notebook is doing too much, and I am not really sure why we are forcing it to be in a notebook, when it seems like what we want to do is to write a data conversion pipeline tool that wraps docling. If that's what we are delivering, wouldn't it be better off as a product, with standards, testing, explicit inputs and outputs, an sdk/cli/ui... etc, rather than an increasingly complex script that you run in a notebook?

@iamemilio I see where you're coming from, however I would lean against drawing boxes around any particular part of the workflow at the moment. Modularization is definitely the ultimate goal for a product, but from my understanding the initial push towards notebooks was so they could serve as a bucket of code where we could get an unbiased sense of all the customizability we would want. I think it makes sense to build this up to include all the customizability we would want and then break it down into more concrete modules once product requirements are clearer.

"# Summary\n",
"\n",
"To recap, given a source document in PDF format, this notebook:\n",
"To recap, given a source documents in PDF format, this notebook:\n",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you thinking of a specific tense for this, because before the tenses varied a lot and were inconsistent. It does matter to me which one, It should all just be in the same tense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you suggest the tense we should stick with? Maybe we do a follow up PR aligning all of the tenses in all of the documentation within each section?

CC: @JustinXHale

uses: actions/setup-python@v5
with:
python-version: '3.11'
python-version: '3.12'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI, I successfully ran it end-to-end on 3.11 with just one minor adjustment. On notebooks/instructlab-knowledge/instructlab-knowledge.ipynb#232 , use contribution['name'] (single quotes).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made that adjustment but I was hoping to use this PR as an opportunity to bump the python version for the notebook entirely. I'm not sure how high or low a user could go. If we say we're going to support python 3.11 and 3.12 then we should have a CI job for each one.

@fabianofranz
Copy link
Contributor

Not only related to this PR, but I realized that the sample documents we provide don't generate the correct 3 answer/question pairs. While that's non-deterministic, should we switch to samples that have higher chance of generating the pairs correctly, so the notebook runs cleanly end-to-end without the need of user review/intervention? /cc @alimaredia

Signed-off-by: Ali Maredia <[email protected]>
Docling-sdg routinely generated less than 5 seed
examples so the number is being set higher.

The notebook and documentation has been reworded to
prescribe a minumum number of seed examples to check
againist.

Signed-off-by: Ali Maredia <[email protected]>
@alimaredia
Copy link
Contributor Author

Not only related to this PR, but I realized that the sample documents we provide don't generate the correct 3 answer/question pairs. While that's non-deterministic, should we switch to samples that have higher chance of generating the pairs correctly, so the notebook runs cleanly end-to-end without the need of user review/intervention? /cc @alimaredia

That's a good call. Actually had the idea of removing the Inference time scaling contribution (which was the one I found throwing more errors) and using Fifa soccer rules to have the examples contributions closer in theme and contents.

One other thing to note is that the NFL documents were greatly slimmed down for conversions in our CI to take less time. I think adding pages back into those documents might improve how often the proper number of q&a pairs are generate correctly.

Since this PR has so many changes could I add a follow up with those changes?

@alimaredia
Copy link
Contributor Author

Everything ran correctly, and the documentation was clear enough for me to follow the contributions system. A couple of thoughts, although not sure if these should be addressed here or as separate features in separate pr's:

* Since we are using os.environ() for the authoring section, it could be helpful to provide and reference an example.env file

* I have mentioned this before/have built it into my UI, but when users upload multiple files, it would be useful to have multiple pipelines with conversion settings for Docling and to be able to assign them to different documents. For example here, the research paper could benefit from image classification/description and formula enrichment, whereas these are largely unnecessary/a waste of compute for the NFL rulebook. I have a [notebook here](https://github.com/Ryfernandes/docling-table-omission) of a tool I wrote that starts experimenting with some methods of doing this (just specifying conversion settings and adding aliases, not assigning them yet).

@Ryfernandes Emilio had suggested having a data structure for contributions. That seems like it would be a pre-requisite to being able to call into multiple pipelines. I mentioned in a previous comment this should be follow up work after this PR is merged.

@alimaredia
Copy link
Contributor Author

@iamemilio That's a really good point about the scope of the notebook. I think splitting this notebook back out into 3 more well defined notebooks is going to be the way we'll go in the future, but I'd like to get that feedback from a few new users first.

@alimaredia
Copy link
Contributor Author

alimaredia commented Jun 11, 2025

@iamemilio @alinaryan @kelbrown20 @fabianofranz @Ryfernandes I've addressed most of your comments thanks for the feedback. To not keep increasing the scope of this PR there's 3 areas of work I'm going to open upstream issues on after this PR is merged:

  1. Creating a Contribution data structure to enable all of the state of each contribution to be stored
  2. Ensure that the example documents allow users to run through the notebook more consistently generates a proper qna.yaml
  3. Do a docs review and ensure that there's consistency across the entire notebook

What do you think?

This commit adds constants for all of the
subdirectories within a contribution.

It also changes the input of the qna.yaml
reviewer to set where the chunks.jsonl is coming
from and where the output should be.

Signed-off-by: Ali Maredia <[email protected]>
@fabianofranz
Copy link
Contributor

@alimaredia 👍 sound good to me.

@fabianofranz fabianofranz self-requested a review June 11, 2025 20:18
@Ryfernandes
Copy link

Yeah that sounds great

Copy link
Member

@khaledsulayman khaledsulayman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left 2 minor comments. Otherwise, everything looks functionally good. Thanks!

Function is not called anywhere else.

Signed-off-by: Ali Maredia <[email protected]>
Break initial setup into 2 sections:
- Contribution overview which has all documentation
about contributions
- Getting started which has 2 cells where key
variables are initialized.

Signed-off-by: Ali Maredia <[email protected]>
@alimaredia
Copy link
Contributor Author

@khaledsulayman Thanks for the comments! The function you pointed out was deleted since it's not being used anymore and I made some adjustments to the sections in the notebook and the documentation to make it easier to navigate back to the first two cells to re-run them. I was constantly doing that too.

Copy link
Contributor

@iamemilio iamemilio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to follow up tasks. LGTM

@fabianofranz fabianofranz merged commit 32e1661 into instructlab:main Jun 12, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants