Skip to content

Latest commit

 

History

History
120 lines (89 loc) · 3.02 KB

File metadata and controls

120 lines (89 loc) · 3.02 KB

PDF Toolkit

Toolkit odczytujący dane z zagnieżdżonych i skomplikowanych form tabelarycznych z plików PDF oraz obrazków. Podany na wejściu plik ma zostać poprawnie przetworzony, żeby na wyjściu otrzymać strukturę słownikową, odwzorowującą logikę podziału informacji na sekcje, kolumny i inne.

Struktura

pdf-toolkit/
├── web/ # FastAPI application
├── src/toolkit/ # Core library for PDF data extraction
│ ├── algorithms/ # Different extraction algorithms
│ │ ├── complex_tab.py # 
│ │ ├── financial_borderless.py # 
│ │ ├── row_oriented.py # 
│ │ └── table_representation.py # Other extraction methods
│ └── utils/ # Utility functions
│   └── cell.py # PDF-specific utilities
├── tests/ #  
├── data/ # test PDF files
├── docs/ # Documentation
├── requirements.txt # Python dependencies
├── pyproject.toml # Project metadata and build system
├── README.md # Project overview and instructions
└── .gitignore # Ignored files and directories

Add toolkit path to PYTHONPATH in your env

# windows
$env:PYTHONPATH="$env:PYTHONPATH;C:\Users\zprp\pdf-toolkit\src" # replace with your path

# unix
export PYTHONPATH=$PYTHONPATH:/path/to/src

External dependencies - installation:

Poppler

You will also need to install poppler as this library uses pdf2image which is a poppler wrapper.

# windows
# download zip from https://github.com/oschwartz10612/poppler-windows/releases
# unpack and add /bin folder to path

# ubuntu
sudo apt install poppler-utils

# mac
brew install poppler

OCR

Install tesseract on your OS

# ubuntu
sudo apt install tesseract-ocr
sudo apt install tesseract-ocr-pol # for polish language

# mac
brew install tesseract
brew install tesseract-lang # for polish language

Torch with CUDA

For working on CUDA u need to add torch in env

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Install, Test, Run

Install PDM package for easier lib build and installation

pip install pdm

Makefile Commands - run lib

make install # install dependencies
make test # run tests
make build # build lib

Use CLI

""" For complex nested data with grid and alligned """
python src/toolkit data/complex/complex-example.pdf --algorithm=alligned-bordered --ocr=tesseract

""" For misalligned data without grid - like in financial reports"""
python src/toolkit data/financial/meta-financial-page5.pdf --bordered=False --algorithm=misalligned-borderless

""" For row orineted data with table grid """
python src/toolkit data/medical/hospitals_2017.pdf --first-page=1 --last-page=2 --algorithm=misalligned

PDM commands

pdm init # creates env
pdm add <package> # adds package
pdm add --dev pytest # adds dev package
pdm run pytest tests/ # runs tests
pdm build -> pip install dist/*.whl -> # builds and installs package

Run interpreter tests

pip install nox==2025.5.1
nox