PDF Toolkit

Toolkit odczytujący dane z zagnieżdżonych i skomplikowanych form tabelarycznych z plików PDF oraz obrazków. Podany na wejściu plik ma zostać poprawnie przetworzony, żeby na wyjściu otrzymać strukturę słownikową, odwzorowującą logikę podziału informacji na sekcje, kolumny i inne.

Struktura

pdf-toolkit/
├── web/ # FastAPI application
├── src/toolkit/ # Core library for PDF data extraction
│ ├── algorithms/ # Different extraction algorithms
│ │ ├── complex_tab.py # 
│ │ ├── financial_borderless.py # 
│ │ ├── row_oriented.py # 
│ │ └── table_representation.py # Other extraction methods
│ └── utils/ # Utility functions
│   └── cell.py # PDF-specific utilities
├── tests/ #  
├── data/ # test PDF files
├── docs/ # Documentation
├── requirements.txt # Python dependencies
├── pyproject.toml # Project metadata and build system
├── README.md # Project overview and instructions
└── .gitignore # Ignored files and directories

Add toolkit path to PYTHONPATH in your env

# windows
$env:PYTHONPATH="$env:PYTHONPATH;C:\Users\zprp\pdf-toolkit\src" # replace with your path

# unix
export PYTHONPATH=$PYTHONPATH:/path/to/src

External dependencies - installation:

Poppler

You will also need to install poppler as this library uses pdf2image which is a poppler wrapper.

# windows
# download zip from https://github.com/oschwartz10612/poppler-windows/releases
# unpack and add /bin folder to path

# ubuntu
sudo apt install poppler-utils

# mac
brew install poppler

OCR

Install tesseract on your OS

# ubuntu
sudo apt install tesseract-ocr
sudo apt install tesseract-ocr-pol # for polish language

# mac
brew install tesseract
brew install tesseract-lang # for polish language

Torch with CUDA

For working on CUDA u need to add torch in env

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Install, Test, Run

Install PDM package for easier lib build and installation

pip install pdm

Makefile Commands - run lib

make install # install dependencies
make test # run tests
make build # build lib

Use CLI

""" For complex nested data with grid and alligned """
python src/toolkit data/complex/complex-example.pdf --algorithm=alligned-bordered --ocr=tesseract

""" For misalligned data without grid - like in financial reports"""
python src/toolkit data/financial/meta-financial-page5.pdf --bordered=False --algorithm=misalligned-borderless

""" For row orineted data with table grid """
python src/toolkit data/medical/hospitals_2017.pdf --first-page=1 --last-page=2 --algorithm=misalligned

PDM commands

pdm init # creates env
pdm add <package> # adds package
pdm add --dev pytest # adds dev package
pdm run pytest tests/ # runs tests
pdm build -> pip install dist/*.whl -> # builds and installs package

Run interpreter tests

pip install nox==2025.5.1
nox

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF Toolkit

Struktura

Add toolkit path to PYTHONPATH in your env

External dependencies - installation:

Poppler

OCR

Torch with CUDA

Install, Test, Run

Makefile Commands - run lib

Use CLI

PDM commands

Run interpreter tests

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

PDF Toolkit

Struktura

Add toolkit path to PYTHONPATH in your env

External dependencies - installation:

Poppler

OCR

Torch with CUDA

Install, Test, Run

Makefile Commands - run lib

Use CLI

PDM commands

Run interpreter tests