Skip to content

A tool that processes PDFs, extracts specified pages, performs OCR, converts text into structured tables. Supports batch processing with parallel execution.

Notifications You must be signed in to change notification settings

ayahaustine/ocr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF OCR Table Extractor

A Python tool that processes PDFs, extracts specified pages, performs OCR, converts text into structured tables, and saves them as CSV files. Supports batch processing of all PDFs in a folder with parallel execution.

Features

  • Extracts single pages or page ranges from PDFs
  • OCR via Tesseract
  • Converts detected tables into CSV format
  • Handles multi-line table cells
  • Processes all PDFs in input_pdfs/ in parallel
  • Saves raw text and structured CSV outputs
  • Graceful error handling and logging

Setup

  1. Install dependencies: pip install -r requirements.txt

  2. Install Tesseract:

    • Linux: sudo apt install tesseract-ocr
    • Mac: brew install tesseract
  3. Install Poppler:

    • Linux: sudo apt install poppler-utils
    • Mac: brew install poppler
  4. Place PDFs in input_pdfs/.

  5. Run: python main.py

  6. Outputs are saved in output_csv/.

About

A tool that processes PDFs, extracts specified pages, performs OCR, converts text into structured tables. Supports batch processing with parallel execution.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages