-
Notifications
You must be signed in to change notification settings - Fork 22
Add PDF to Markdown extraction workflow #67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -84,7 +84,14 @@ mkdir -p data/chapters | |
| cp your-documents.pdf data/chapters/ | ||
| ``` | ||
|
|
||
| ### 5) Index documents | ||
| ### 5) Extract PDF to markdown | ||
|
|
||
| ```shell | ||
| make run-extract | ||
| ``` | ||
| This generates a `book_with_pages.md` under `TOKENSMITH/data/` | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Adjust accordingly to prev comments |
||
|
|
||
| ### 6) Index documents | ||
|
|
||
| ```shell | ||
| make run-index | ||
|
|
@@ -96,15 +103,15 @@ With custom parameters: | |
| make run-index ARGS="--pdf_range 1-10 --chunk_mode chars --visualize" | ||
| ``` | ||
|
|
||
| ### 6) Chat | ||
| ### 7) Chat | ||
|
|
||
| ```shell | ||
| python -m src.main chat | ||
| ``` | ||
|
|
||
| > If you see a missing-model error, download `qwen2.5-0.5b-instruct-q5_k_m.gguf` into `llama.cpp/models`. | ||
|
|
||
| ### 7) Deactivate | ||
| ### 8) Deactivate | ||
|
|
||
| ```shell | ||
| conda deactivate | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -87,7 +87,7 @@ def run_index_mode(args: argparse.Namespace, cfg: RAGConfig): | |
| artifacts_dir = cfg.get_artifacts_directory() | ||
|
|
||
| build_index( | ||
| markdown_file="data/silberschatz.md", | ||
| markdown_file="data/book_with_pages.md", | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similarly here. Look over all the mds in ./data and just build the index over the first md file you find. I will fix this behavior later myself but this is just so the code doesnt break after your PR merge |
||
| chunker=chunker, | ||
| chunk_config=cfg.chunk_config, | ||
| embedding_model_path=cfg.embed_model, | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -275,8 +275,22 @@ def preprocess_extracted_section(text: str) -> str: | |
|
|
||
|
|
||
| if __name__ == '__main__': | ||
| input_pdf = "data/chapters/silberschatz.pdf" | ||
| output_md = 'data/silberschatz.md' | ||
| # Returns all pdf files under data/chapters/ | ||
| chapters_dir = Path("data/chapters") | ||
| pdfs = sorted(chapters_dir.glob("*.pdf")) | ||
|
|
||
| # Ensure exactly one PDF is found | ||
| if len(pdfs) == 0: | ||
| print("ERROR: No PDFs found in data/chapters/. Please copy a PDF there first.", file=sys.stderr) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please change this to accept multiple PDFs. I know this may break other things in the pipeline but I'll fix those afterwards. Basically, you will do extraction for each PDF you find and store the results in the "data" folder with this naming convention "<input_file_name_without_the_.pdf>--extracted_markdown.md>". So if you have 2 files "chapter1.pdf" and "blah2.pdf" you will have 2 mds in ./data named "chapter1--extracted_markdown.md" and "blah2--extracted_markdown.md" |
||
| sys.exit(1) | ||
| if len(pdfs) > 1: | ||
| print("ERROR: Multiple PDFs found in data/chapters/. Keep only one for now:", file=sys.stderr) | ||
| for p in pdfs: | ||
| print(f" - {p}", file=sys.stderr) | ||
| sys.exit(1) | ||
|
|
||
| input_pdf = str(pdfs[0]) | ||
| output_md = "data/book_with_pages.md" | ||
|
|
||
| print(f"Converting '{input_pdf}' to '{output_md}'...") | ||
| convert_and_save_with_page_numbers(input_pdf, output_md) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjust the "-> data/book_with_pages.md" of this echo statement to just files MD files in ./data