Skip to content

Conversation

@alinaryan
Copy link
Member

@alinaryan alinaryan commented May 5, 2025

This change adds the Illuminator tool’s core functions to the instructlab-knowledge notebook for analyzing a converted document and summarizing merged table cell issues for each table.

Also refactor's the illuminator to accept json as input and adjusts some imports to be relative

@alinaryan alinaryan force-pushed the add-illuminator-notebook branch 2 times, most recently from adb4520 to 420d143 Compare May 20, 2025 18:01


def analyze_pdf_with_docling(file_path) -> Dict[str, Union[int, List[Any], set]]:
def convert_pdf_with_docling(file_path: str) -> DoclingDocument:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to rename this to `convert_to_docling_document since it's not pdf specific anymore. It would also be nice for this function to have an argument on whether or not the markdown should be saved or not. I'm not sure in every case we want that.

@alimaredia
Copy link
Contributor

alimaredia commented May 21, 2025

@alinaryan This PR title mentions adding an illuminator notebook but there isn't on in here. Maybe the title should change to reflect the refactoring you're doing.

Also if you added the illuminator into the instructlab-knowledge notebook do you have an idea of what the calls the notebook would make might look like?

@alinaryan alinaryan force-pushed the add-illuminator-notebook branch 2 times, most recently from 974086f to f6f8fc0 Compare May 21, 2025 20:43
@alinaryan
Copy link
Member Author

This PR title mentions adding an illuminator notebook but there isn't on in here. Maybe the title should change to reflect the refactoring you're doing.
Also if you added the illuminator into the instructlab-knowledge notebook do you have an idea of what the calls the notebook would make might look like?

@alimaredia
Added a commit that adds the notebook and reworded the PR desc. It currently outputs the illuminator results to the notebook output AND to an output file. LMK what you think about this approach

@alinaryan alinaryan force-pushed the add-illuminator-notebook branch from f6f8fc0 to f276f14 Compare May 27, 2025 18:09
@iamemilio
Copy link
Contributor

I think its worth going through this and clearing all the outputs from the instructlab-knowledge.ipynb notebook

@iamemilio
Copy link
Contributor

I like the little emoji's in the console output

@alinaryan alinaryan force-pushed the add-illuminator-notebook branch from f276f14 to d625f13 Compare May 29, 2025 14:56
@alinaryan
Copy link
Member Author

@JustinXHale LMK what you think about the UX design here

@JustinXHale
Copy link
Member

JustinXHale commented Jun 3, 2025

This looks great @alinaryan!

UX Review of Data Pre-Processing: From source PDF to SDG-ready

Guideline Status Suggestion
Header Clear and concise title.
Goal/Objective Present The notebook provides a brief summary of what it does and why at the beginning. The numbered list (1, 2, 3, 4) clearly outlines the main steps users will follow in the notebook.
Setup & Prerequisites Prerequisites (like !pip install commands) are placed within each relevant major section, which supports a modular and step-by-step approach.
Markdown Before Code 🔹 Most code cells are preceded by markdown cells explaining their purpose. However, the initial code cell defining Path variables and directory creation lacks a clear introductory markdown explaining its role (e.g., "Initialize Workspace and Define Output Paths"). The empty code cell before "Read generated QAs and restructure" also lacks a preceding markdown cell, its probably left there by accident. Improvement to the "Why" before each section could be improved.
Handle Errors or Outputs The notebook includes print statements to show expected outputs and also notes a warning regarding single file support ("***** WARNING! Only one file at a time is supported at this time."). It also provides guidance if QA generation fails to meet the required number of pairs.
Hardcoding Avoided File paths are managed using pathlib.Path and defined variables (WORKSPACE_ROOT, SOURCE_DOCUMENT_DIR, etc.).
Concise Cells Code logic is generally broken into manageable chunks, avoiding overly long or dense cells. This is really good!
Code Commenting Inline comments are present in relevant sections to clarify logic.
End Wrap-Up & Handoff The notebook concludes with a "Summary" section that recaps the completed steps and clearly outlines the next steps, including a link to a relevant follow-up notebook.

Reviewer Notes

The notebook provides a clear and well-structured workflow for data pre-processing. Improving the initial setup instructions and adding more descriptive inline comments in key code sections could further enhance its user experience. The use of markdown headers and logical code separation is generally good. The numbered list in the introduction serves well as a table of contents, guiding the user through the notebook's flow. Improvement to the "Why/goal" before each section could be improved, similar to what is done in chunking.

" try:\n",
" generate_summary(results)\n",
" finally:\n",
" sys.stdout = original_stdout\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alinaryan when I ran through the notebook I wasn't seeing any output in the illuminator_readable_summary.txt. Any idea why this might be happening?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated, the full output now prints to the summary file and the notebook cell output

@alinaryan alinaryan force-pushed the add-illuminator-notebook branch 2 times, most recently from 7b00ea8 to 1f1cd45 Compare June 4, 2025 20:58
"\n",
"***"
]
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it back!

@alinaryan alinaryan force-pushed the add-illuminator-notebook branch 2 times, most recently from 385f1b1 to d06d067 Compare June 5, 2025 18:00
Signed-off-by: Alina Ryan <[email protected]>
@alinaryan alinaryan force-pushed the add-illuminator-notebook branch from d06d067 to 34107d9 Compare June 5, 2025 18:05
@alimaredia alimaredia merged commit d96f286 into instructlab:main Jun 5, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants