Toolkit odczytujący dane z zagnieżdżonych i skomplikowanych form tabelarycznych z plików PDF oraz obrazków. Podany na wejściu plik ma zostać poprawnie przetworzony, żeby na wyjściu otrzymać strukturę słownikową, odwzorowującą logikę podziału informacji na sekcje, kolumny i inne.
pdf-toolkit/
├── web/ # FastAPI application
├── src/toolkit/ # Core library for PDF data extraction
│ ├── algorithms/ # Different extraction algorithms
│ │ ├── complex_tab.py #
│ │ ├── financial_borderless.py #
│ │ ├── row_oriented.py #
│ │ └── table_representation.py # Other extraction methods
│ └── utils/ # Utility functions
│ └── cell.py # PDF-specific utilities
├── tests/ #
├── data/ # test PDF files
├── docs/ # Documentation
├── requirements.txt # Python dependencies
├── pyproject.toml # Project metadata and build system
├── README.md # Project overview and instructions
└── .gitignore # Ignored files and directories# windows
$env:PYTHONPATH="$env:PYTHONPATH;C:\Users\zprp\pdf-toolkit\src" # replace with your path
# unix
export PYTHONPATH=$PYTHONPATH:/path/to/srcYou will also need to install poppler as this library uses pdf2image which is a poppler wrapper.
# windows
# download zip from https://github.com/oschwartz10612/poppler-windows/releases
# unpack and add /bin folder to path
# ubuntu
sudo apt install poppler-utils
# mac
brew install popplerInstall tesseract on your OS
# ubuntu
sudo apt install tesseract-ocr
sudo apt install tesseract-ocr-pol # for polish language
# mac
brew install tesseract
brew install tesseract-lang # for polish languageFor working on CUDA u need to add torch in env
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118Install PDM package for easier lib build and installation
pip install pdmmake install # install dependencies
make test # run tests
make build # build lib""" For complex nested data with grid and alligned """
python src/toolkit data/complex/complex-example.pdf --algorithm=alligned-bordered --ocr=tesseract
""" For misalligned data without grid - like in financial reports"""
python src/toolkit data/financial/meta-financial-page5.pdf --bordered=False --algorithm=misalligned-borderless
""" For row orineted data with table grid """
python src/toolkit data/medical/hospitals_2017.pdf --first-page=1 --last-page=2 --algorithm=misallignedpdm init # creates env
pdm add <package> # adds package
pdm add --dev pytest # adds dev package
pdm run pytest tests/ # runs tests
pdm build -> pip install dist/*.whl -> # builds and installs packagepip install nox==2025.5.1
nox