-
Notifications
You must be signed in to change notification settings - Fork 11
Add ability for instructlab-knowledge notebook to take multiple source and qna files
#18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability for instructlab-knowledge notebook to take multiple source and qna files
#18
Conversation
|
What's the design goal behind the new |
Signed-off-by: Ali Maredia <[email protected]>
Other minor cleanup to remove unused variable in authoring code Signed-off-by: Ali Maredia <[email protected]>
Signed-off-by: Ali Maredia <[email protected]>
aa072ea to
9b6b4f7
Compare
Signed-off-by: Ali Maredia <[email protected]>
Signed-off-by: Ali Maredia <[email protected]>
Signed-off-by: Khaled Sulayman <[email protected]>
…l authoring Signed-off-by: Khaled Sulayman <[email protected]>
Signed-off-by: Ali Maredia <[email protected]>
9b6b4f7 to
e83a403
Compare
This commit adds the following:: - extensive documentation on what a knowledge contribution is - a check on the qna.yaml file to make sure all the fields exist for the file to be used in the next step. - Use os.getenv() to get environment variables for q&a generation model - Bumps python version in CI to 3.12 - Runs the Illuminator tool over all contributions Signed-off-by: Ali Maredia <[email protected]>
e83a403 to
3d7d5b8
Compare
|
@anastasds A contribution is just groupings of one or more source documents that results in the qna.yaml. I added documentation in my latest changes that describe what contributions are. I tried my best to minimize the number of variables users need to set, but the only net new variable should just be the directory the contribution artifacts live in. |
Ryfernandes
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything ran correctly, and the documentation was clear enough for me to follow the contributions system. A couple of thoughts, although not sure if these should be addressed here or as separate features in separate pr's:
- Since we are using os.environ() for the authoring section, it could be helpful to provide and reference an example.env file
- I have mentioned this before/have built it into my UI, but when users upload multiple files, it would be useful to have multiple pipelines with conversion settings for Docling and to be able to assign them to different documents. For example here, the research paper could benefit from image classification/description and formula enrichment, whereas these are largely unnecessary/a waste of compute for the NFL rulebook. I have a notebook here of a tool I wrote that starts experimenting with some methods of doing this (just specifying conversion settings and adding aliases, not assigning them yet).
|
I agree with Ryan about wanting different conversion options based on document layout and composition. I am starting to wonder if the |
@iamemilio I see where you're coming from, however I would lean against drawing boxes around any particular part of the workflow at the moment. Modularization is definitely the ultimate goal for a product, but from my understanding the initial push towards notebooks was so they could serve as a bucket of code where we could get an unbiased sense of all the customizability we would want. I think it makes sense to build this up to include all the customizability we would want and then break it down into more concrete modules once product requirements are clearer. |
| "# Summary\n", | ||
| "\n", | ||
| "To recap, given a source document in PDF format, this notebook:\n", | ||
| "To recap, given a source documents in PDF format, this notebook:\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you thinking of a specific tense for this, because before the tenses varied a lot and were inconsistent. It does matter to me which one, It should all just be in the same tense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you suggest the tense we should stick with? Maybe we do a follow up PR aligning all of the tenses in all of the documentation within each section?
CC: @JustinXHale
| uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: '3.11' | ||
| python-version: '3.12' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just FYI, I successfully ran it end-to-end on 3.11 with just one minor adjustment. On notebooks/instructlab-knowledge/instructlab-knowledge.ipynb#232 , use contribution['name'] (single quotes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made that adjustment but I was hoping to use this PR as an opportunity to bump the python version for the notebook entirely. I'm not sure how high or low a user could go. If we say we're going to support python 3.11 and 3.12 then we should have a CI job for each one.
Signed-off-by: Ali Maredia <[email protected]>
|
Not only related to this PR, but I realized that the sample documents we provide don't generate the correct 3 answer/question pairs. While that's non-deterministic, should we switch to samples that have higher chance of generating the pairs correctly, so the notebook runs cleanly end-to-end without the need of user review/intervention? /cc @alimaredia |
Signed-off-by: Ali Maredia <[email protected]>
Signed-off-by: Ali Maredia <[email protected]>
Docling-sdg routinely generated less than 5 seed examples so the number is being set higher. The notebook and documentation has been reworded to prescribe a minumum number of seed examples to check againist. Signed-off-by: Ali Maredia <[email protected]>
That's a good call. Actually had the idea of removing the Inference time scaling contribution (which was the one I found throwing more errors) and using Fifa soccer rules to have the examples contributions closer in theme and contents. One other thing to note is that the NFL documents were greatly slimmed down for conversions in our CI to take less time. I think adding pages back into those documents might improve how often the proper number of q&a pairs are generate correctly. Since this PR has so many changes could I add a follow up with those changes? |
@Ryfernandes Emilio had suggested having a data structure for contributions. That seems like it would be a pre-requisite to being able to call into multiple pipelines. I mentioned in a previous comment this should be follow up work after this PR is merged. |
|
@iamemilio That's a really good point about the scope of the notebook. I think splitting this notebook back out into 3 more well defined notebooks is going to be the way we'll go in the future, but I'd like to get that feedback from a few new users first. |
|
@iamemilio @alinaryan @kelbrown20 @fabianofranz @Ryfernandes I've addressed most of your comments thanks for the feedback. To not keep increasing the scope of this PR there's 3 areas of work I'm going to open upstream issues on after this PR is merged:
What do you think? |
This commit adds constants for all of the subdirectories within a contribution. It also changes the input of the qna.yaml reviewer to set where the chunks.jsonl is coming from and where the output should be. Signed-off-by: Ali Maredia <[email protected]>
|
@alimaredia 👍 sound good to me. |
|
Yeah that sounds great |
khaledsulayman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left 2 minor comments. Otherwise, everything looks functionally good. Thanks!
Function is not called anywhere else. Signed-off-by: Ali Maredia <[email protected]>
Break initial setup into 2 sections: - Contribution overview which has all documentation about contributions - Getting started which has 2 cells where key variables are initialized. Signed-off-by: Ali Maredia <[email protected]>
|
@khaledsulayman Thanks for the comments! The function you pointed out was deleted since it's not being used anymore and I made some adjustments to the sections in the notebook and the documentation to make it easier to navigate back to the first two cells to re-run them. I was constantly doing that too. |
iamemilio
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to follow up tasks. LGTM
instructlab-knowledge.ipynbonly accepted one PDF file and created oneqna.yamlfile. This PR allows multiple .pdf files to be converted and chunked and allows multiple knowledge contributions each of which can have multiple source pdf files and have one qna.yaml generated.This PR also contains minor cleanup also a changing to chunking code from having chunks in individual .txt files to all of the chunks being moved into a JSONL file.