-
Notifications
You must be signed in to change notification settings - Fork 22
Add PDF to Markdown extraction workflow #67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Introduces a Makefile target and documentation for extracting PDFs in data/chapters/ to a unified markdown file (data/book_with_pages.md). Updates extraction.py to process exactly one PDF and output to the new markdown file, and updates main.py to use this file for indexing. README instructions are revised to reflect the new extraction step.
|
|
||
| # Ensure exactly one PDF is found | ||
| if len(pdfs) == 0: | ||
| print("ERROR: No PDFs found in data/chapters/. Please copy a PDF there first.", file=sys.stderr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change this to accept multiple PDFs. I know this may break other things in the pipeline but I'll fix those afterwards. Basically, you will do extraction for each PDF you find and store the results in the "data" folder with this naming convention "<input_file_name_without_the_.pdf>--extracted_markdown.md>".
So if you have 2 files "chapter1.pdf" and "blah2.pdf" you will have 2 mds in ./data named "chapter1--extracted_markdown.md" and "blah2--extracted_markdown.md"
|
|
||
| build_index( | ||
| markdown_file="data/silberschatz.md", | ||
| markdown_file="data/book_with_pages.md", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly here. Look over all the mds in ./data and just build the index over the first md file you find. I will fix this behavior later myself but this is just so the code doesnt break after your PR merge
|
|
||
| # PDF to Markdown extraction | ||
| run-extract: | ||
| @echo "Extracting PDF to markdown (data/chapters/*.pdf -> data/book_with_pages.md)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjust the "-> data/book_with_pages.md" of this echo statement to just files MD files in ./data
| ```shell | ||
| make run-extract | ||
| ``` | ||
| This generates a `book_with_pages.md` under `TOKENSMITH/data/` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjust accordingly to prev comments
shahmeer99
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have left comments regarding multi file handling behavior that need to be addressed. Please look and fix accordingly
|
@jlee600 I think it might be useful to add pytests for this to ensure that pdf parsing doesn't break in the future |
Introduces a Makefile target and documentation for extracting PDFs in data/chapters/ to a unified markdown file (data/book_with_pages.md).
Updates extraction.py to process exactly one PDF and output to the new markdown file, and updates main.py to use this file for indexing. README instructions are revised to reflect the new extraction step.
No more textbook title and output name hardcoding.