Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,11 @@ clean:
find . -type d -name __pycache__ -exec rm -rf {} +
find . -type f -name "*.pyc" -delete

# PDF to Markdown extraction
run-extract:
@echo "Extracting PDF to markdown (data/chapters/*.pdf -> data/book_with_pages.md)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjust the "-> data/book_with_pages.md" of this echo statement to just files MD files in ./data

conda run --no-capture-output -n tokensmith python -m src.preprocessing.extraction

# Run modes
run-index:
@echo "Running TokenSmith index mode with additional CLI args: $(ARGS)"
Expand Down
13 changes: 10 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,14 @@ mkdir -p data/chapters
cp your-documents.pdf data/chapters/
```

### 5) Index documents
### 5) Extract PDF to markdown

```shell
make run-extract
```
This generates a `book_with_pages.md` under `TOKENSMITH/data/`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjust accordingly to prev comments


### 6) Index documents

```shell
make run-index
Expand All @@ -96,15 +103,15 @@ With custom parameters:
make run-index ARGS="--pdf_range 1-10 --chunk_mode chars --visualize"
```

### 6) Chat
### 7) Chat

```shell
python -m src.main chat
```

> If you see a missing-model error, download `qwen2.5-0.5b-instruct-q5_k_m.gguf` into `llama.cpp/models`.

### 7) Deactivate
### 8) Deactivate

```shell
conda deactivate
Expand Down
2 changes: 1 addition & 1 deletion src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ def run_index_mode(args: argparse.Namespace, cfg: RAGConfig):
artifacts_dir = cfg.get_artifacts_directory()

build_index(
markdown_file="data/silberschatz.md",
markdown_file="data/book_with_pages.md",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here. Look over all the mds in ./data and just build the index over the first md file you find. I will fix this behavior later myself but this is just so the code doesnt break after your PR merge

chunker=chunker,
chunk_config=cfg.chunk_config,
embedding_model_path=cfg.embed_model,
Expand Down
18 changes: 16 additions & 2 deletions src/preprocessing/extraction.py
Original file line number Diff line number Diff line change
Expand Up @@ -275,8 +275,22 @@ def preprocess_extracted_section(text: str) -> str:


if __name__ == '__main__':
input_pdf = "data/chapters/silberschatz.pdf"
output_md = 'data/silberschatz.md'
# Returns all pdf files under data/chapters/
chapters_dir = Path("data/chapters")
pdfs = sorted(chapters_dir.glob("*.pdf"))

# Ensure exactly one PDF is found
if len(pdfs) == 0:
print("ERROR: No PDFs found in data/chapters/. Please copy a PDF there first.", file=sys.stderr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change this to accept multiple PDFs. I know this may break other things in the pipeline but I'll fix those afterwards. Basically, you will do extraction for each PDF you find and store the results in the "data" folder with this naming convention "<input_file_name_without_the_.pdf>--extracted_markdown.md>".

So if you have 2 files "chapter1.pdf" and "blah2.pdf" you will have 2 mds in ./data named "chapter1--extracted_markdown.md" and "blah2--extracted_markdown.md"

sys.exit(1)
if len(pdfs) > 1:
print("ERROR: Multiple PDFs found in data/chapters/. Keep only one for now:", file=sys.stderr)
for p in pdfs:
print(f" - {p}", file=sys.stderr)
sys.exit(1)

input_pdf = str(pdfs[0])
output_md = "data/book_with_pages.md"

print(f"Converting '{input_pdf}' to '{output_md}'...")
convert_and_save_with_page_numbers(input_pdf, output_md)
Expand Down