A Python tool that processes PDFs, extracts specified pages, performs OCR, converts text into structured tables, and saves them as CSV files. Supports batch processing of all PDFs in a folder with parallel execution.
- Extracts single pages or page ranges from PDFs
- OCR via Tesseract
- Converts detected tables into CSV format
- Handles multi-line table cells
- Processes all PDFs in
input_pdfs/in parallel - Saves raw text and structured CSV outputs
- Graceful error handling and logging
-
Install dependencies: pip install -r requirements.txt
-
Install Tesseract:
- Linux: sudo apt install tesseract-ocr
- Mac: brew install tesseract
-
Install Poppler:
- Linux: sudo apt install poppler-utils
- Mac: brew install poppler
-
Place PDFs in
input_pdfs/. -
Run: python main.py
-
Outputs are saved in
output_csv/.