Add PDF to Markdown extraction workflow #67

jlee600 · 2026-01-30T03:07:59Z

Introduces a Makefile target and documentation for extracting PDFs in data/chapters/ to a unified markdown file (data/book_with_pages.md).
Updates extraction.py to process exactly one PDF and output to the new markdown file, and updates main.py to use this file for indexing. README instructions are revised to reflect the new extraction step.
No more textbook title and output name hardcoding.

Introduces a Makefile target and documentation for extracting PDFs in data/chapters/ to a unified markdown file (data/book_with_pages.md). Updates extraction.py to process exactly one PDF and output to the new markdown file, and updates main.py to use this file for indexing. README instructions are revised to reflect the new extraction step.

shahmeer99 · 2026-02-01T23:08:57Z

src/preprocessing/extraction.py

+
+    # Ensure exactly one PDF is found
+    if len(pdfs) == 0:
+        print("ERROR: No PDFs found in data/chapters/. Please copy a PDF there first.", file=sys.stderr)


Please change this to accept multiple PDFs. I know this may break other things in the pipeline but I'll fix those afterwards. Basically, you will do extraction for each PDF you find and store the results in the "data" folder with this naming convention "<input_file_name_without_the_.pdf>--extracted_markdown.md>".

So if you have 2 files "chapter1.pdf" and "blah2.pdf" you will have 2 mds in ./data named "chapter1--extracted_markdown.md" and "blah2--extracted_markdown.md"

shahmeer99 · 2026-02-01T23:10:32Z

src/main.py


    build_index(
-        markdown_file="data/silberschatz.md",
+        markdown_file="data/book_with_pages.md",


Similarly here. Look over all the mds in ./data and just build the index over the first md file you find. I will fix this behavior later myself but this is just so the code doesnt break after your PR merge

shahmeer99 · 2026-02-01T23:11:21Z

Makefile


+# PDF to Markdown extraction
+run-extract:
+	@echo "Extracting PDF to markdown (data/chapters/*.pdf -> data/book_with_pages.md)"


Adjust the "-> data/book_with_pages.md" of this echo statement to just files MD files in ./data

shahmeer99 · 2026-02-01T23:11:41Z

README.md

+```shell
+make run-extract
+```
+This generates a `book_with_pages.md` under `TOKENSMITH/data/`


Adjust accordingly to prev comments

shahmeer99

I have left comments regarding multi file handling behavior that need to be addressed. Please look and fix accordingly

TanglyEagle7718 · 2026-02-01T23:17:12Z

@jlee600 I think it might be useful to add pytests for this to ensure that pdf parsing doesn't break in the future

jarulraj requested a review from TanglyEagle7718 January 30, 2026 16:45

shahmeer99 requested review from RajShah-1 and shahmeer99 February 1, 2026 23:04

shahmeer99 reviewed Feb 1, 2026

View reviewed changes

shahmeer99 requested changes Feb 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PDF to Markdown extraction workflow #67

Add PDF to Markdown extraction workflow #67

jlee600 commented Jan 30, 2026

Uh oh!

shahmeer99 Feb 1, 2026

Uh oh!

shahmeer99 Feb 1, 2026

Uh oh!

shahmeer99 Feb 1, 2026

Uh oh!

shahmeer99 Feb 1, 2026

Uh oh!

shahmeer99 left a comment

Uh oh!

TanglyEagle7718 commented Feb 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add PDF to Markdown extraction workflow #67

Are you sure you want to change the base?

Add PDF to Markdown extraction workflow #67

Conversation

jlee600 commented Jan 30, 2026

Uh oh!

shahmeer99 Feb 1, 2026

Choose a reason for hiding this comment

Uh oh!

shahmeer99 Feb 1, 2026

Choose a reason for hiding this comment

Uh oh!

shahmeer99 Feb 1, 2026

Choose a reason for hiding this comment

Uh oh!

shahmeer99 Feb 1, 2026

Choose a reason for hiding this comment

Uh oh!

shahmeer99 left a comment

Choose a reason for hiding this comment

Uh oh!

TanglyEagle7718 commented Feb 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants