Add PDF to Markdown extraction workflow #67

shahmeer99 · 2026-02-01T23:11:21Z

Adjust the "-> data/book_with_pages.md" of this echo statement to just files MD files in ./data

shahmeer99 · 2026-02-01T23:11:41Z

Adjust accordingly to prev comments

shahmeer99 · 2026-02-01T23:10:32Z

Similarly here. Look over all the mds in ./data and just build the index over the first md file you find. I will fix this behavior later myself but this is just so the code doesnt break after your PR merge

shahmeer99 · 2026-02-01T23:08:57Z

Please change this to accept multiple PDFs. I know this may break other things in the pipeline but I'll fix those afterwards. Basically, you will do extraction for each PDF you find and store the results in the "data" folder with this naming convention "<input_file_name_without_the_.pdf>--extracted_markdown.md>".

So if you have 2 files "chapter1.pdf" and "blah2.pdf" you will have 2 mds in ./data named "chapter1--extracted_markdown.md" and "blah2--extracted_markdown.md"

-Original file line number
+Diff line change
@@ Expand Up / @@ -58,6 +58,11 @@ clean: @@
     	find . -type d -name __pycache__ -exec rm -rf {} +
     	find . -type f -name "*.pyc" -delete
+    # PDF to Markdown extraction
+    run-extract:
+    	@echo "Extracting PDF to markdown (data/chapters/*.pdf -> data/book_with_pages.md)"
+    	conda run --no-capture-output -n tokensmith python -m src.preprocessing.extraction
     # Run modes
     run-index:
     	@echo "Running TokenSmith index mode with additional CLI args: $(ARGS)"
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -84,7 +84,14 @@ mkdir -p data/chapters @@
     cp your-documents.pdf data/chapters/
     ```
-    ### 5) Index documents
+    ### 5) Extract PDF to markdown
+    ```shell
+    make run-extract
+    ```
+    This generates a `book_with_pages.md` under `TOKENSMITH/data/`
+    ### 6) Index documents
     ```shell
     make run-index
@@ Expand All / @@ -96,15 +103,15 @@ With custom parameters: @@
     make run-index ARGS="--pdf_range 1-10 --chunk_mode chars --visualize"
     ```
-    ### 6) Chat
+    ### 7) Chat
     ```shell
     python -m src.main chat
     ```
     > If you see a missing-model error, download `qwen2.5-0.5b-instruct-q5_k_m.gguf` into `llama.cpp/models`.
-    ### 7) Deactivate
+    ### 8) Deactivate
     ```shell
     conda deactivate
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -87,7 +87,7 @@ def run_index_mode(args: argparse.Namespace, cfg: RAGConfig): @@
         artifacts_dir = cfg.get_artifacts_directory()
         build_index(
-            markdown_file="data/silberschatz.md",
+            markdown_file="data/book_with_pages.md",
             chunker=chunker,
             chunk_config=cfg.chunk_config,
             embedding_model_path=cfg.embed_model,
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -275,8 +275,22 @@ def preprocess_extracted_section(text: str) -> str: @@
     if __name__ == '__main__':
-        input_pdf = "data/chapters/silberschatz.pdf"
-        output_md = 'data/silberschatz.md'
+        # Returns all pdf files under data/chapters/
+        chapters_dir = Path("data/chapters")
+        pdfs = sorted(chapters_dir.glob("*.pdf"))
+        # Ensure exactly one PDF is found
+        if len(pdfs) == 0:
+            print("ERROR: No PDFs found in data/chapters/. Please copy a PDF there first.", file=sys.stderr)
+            sys.exit(1)
+        if len(pdfs) > 1:
+            print("ERROR: Multiple PDFs found in data/chapters/. Keep only one for now:", file=sys.stderr)
+            for p in pdfs:
+                print(f"  - {p}", file=sys.stderr)
+            sys.exit(1)
+        input_pdf = str(pdfs[0])
+        output_md = "data/book_with_pages.md"
         print(f"Converting '{input_pdf}' to '{output_md}'...")
         convert_and_save_with_page_numbers(input_pdf, output_md)
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PDF to Markdown extraction workflow #67

Diff view

Diff view

There are no files selected for viewing

shahmeer99 Feb 1, 2026

Uh oh!

shahmeer99 Feb 1, 2026

Uh oh!

shahmeer99 Feb 1, 2026

Uh oh!

shahmeer99 Feb 1, 2026

Uh oh!

Add PDF to Markdown extraction workflow #67

Are you sure you want to change the base?

Add PDF to Markdown extraction workflow #67

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

shahmeer99 Feb 1, 2026

Choose a reason for hiding this comment

Uh oh!

shahmeer99 Feb 1, 2026

Choose a reason for hiding this comment

Uh oh!

shahmeer99 Feb 1, 2026

Choose a reason for hiding this comment

Uh oh!

shahmeer99 Feb 1, 2026

Choose a reason for hiding this comment

Uh oh!