Skip to content

Conversation

@jlee600
Copy link

@jlee600 jlee600 commented Jan 30, 2026

  1. Introduces a Makefile target and documentation for extracting PDFs in data/chapters/ to a unified markdown file (data/book_with_pages.md).

  2. Updates extraction.py to process exactly one PDF and output to the new markdown file, and updates main.py to use this file for indexing. README instructions are revised to reflect the new extraction step.

  3. No more textbook title and output name hardcoding.

Introduces a Makefile target and documentation for extracting PDFs in data/chapters/ to a unified markdown file (data/book_with_pages.md).

Updates extraction.py to process exactly one PDF and output to the new markdown file, and updates main.py to use this file for indexing. README instructions are revised to reflect the new extraction step.

# Ensure exactly one PDF is found
if len(pdfs) == 0:
print("ERROR: No PDFs found in data/chapters/. Please copy a PDF there first.", file=sys.stderr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change this to accept multiple PDFs. I know this may break other things in the pipeline but I'll fix those afterwards. Basically, you will do extraction for each PDF you find and store the results in the "data" folder with this naming convention "<input_file_name_without_the_.pdf>--extracted_markdown.md>".

So if you have 2 files "chapter1.pdf" and "blah2.pdf" you will have 2 mds in ./data named "chapter1--extracted_markdown.md" and "blah2--extracted_markdown.md"


build_index(
markdown_file="data/silberschatz.md",
markdown_file="data/book_with_pages.md",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here. Look over all the mds in ./data and just build the index over the first md file you find. I will fix this behavior later myself but this is just so the code doesnt break after your PR merge


# PDF to Markdown extraction
run-extract:
@echo "Extracting PDF to markdown (data/chapters/*.pdf -> data/book_with_pages.md)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjust the "-> data/book_with_pages.md" of this echo statement to just files MD files in ./data

```shell
make run-extract
```
This generates a `book_with_pages.md` under `TOKENSMITH/data/`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjust accordingly to prev comments

Copy link
Contributor

@shahmeer99 shahmeer99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left comments regarding multi file handling behavior that need to be addressed. Please look and fix accordingly

@TanglyEagle7718
Copy link

@jlee600 I think it might be useful to add pytests for this to ensure that pdf parsing doesn't break in the future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants