Skip to content

New Project: pypdfium2 #37

@mara004

Description

@mara004

Please describe the project

pypdfium2 is a Python library for high-level PDF operations (e.g. rendering, text extraction), based on Google's PDFium.

The project is relatively young, but has high download count according to pepy.tech statistics (currently 7m/month). It is used for PDF ingestion by some popular AI software (e.g. langchain, docling, dify).

We endeavor to support as many platforms as possible with pre-built wheel packages.

URL for the project
https://github.com/pypdfium2-team/pypdfium2
https://pypi.org/project/pypdfium2/
https://pypdfium2.readthedocs.io/en/stable/

Describe current CI/CD setup
GitHub Actions (?)

Describe the primary use case for the Github Action Runner
Building release binary wheels for ppc64le and s390x (manylinux and musllinux) through cibuildwheel.

Paste a link to the existing actions workflow file(s) or directory
https://github.com/pypdfium2-team/pypdfium2/tree/main/.github/workflows/cibw.yaml
Maybe also https://github.com/pypdfium2-team/pypdfium2/tree/main/.github/workflows/sbuild_native.yaml once setup-python supports ppc64le and s390x, cf. actions/setup-python#1154.

How often do you plan on executing the runner?
On release, and for testing as-needed.

What is the primary programming language for the project?
Python, with C/C++ binary extension (ctypes-based).

Please select desired hardware

  • Power 9 (ppc64le)
  • IBM Z / LinuxONE (s390x)

Account names of the GitHub repo admins that will need access to setting up the runner
@mara004

Further Notes

For most platforms, pypdfium2 just repacks bblanchon's cross-compiled pdfium-binaries, but ppc64le/s390x are not handled in Google's toolchain yet.
A ppc64le downstream figured out that it theoretically can be done, but needs patches and self-built sysroot + libclang_rt builtins.

As this would require upstream changes at pdfium-binaries/pdfium, it is not clear if/when that approach could be taken, whereas building in a native runner with cibuildwheel is something we could do today from the pypdfium2 side.

Theoretically, we might be able to use emulation, but it is wastefully slow, e.g. 1h30 emulated vs 4-6min native. In case we need to build tests, or vendor more dependencies (e.g. ICU for smaller wheel sizes, or the libc++ for ABI safety), emulation becomes even less feasible.

Disclaimer

I'm a Python developer, not a compiled language expert, so we depend on a good basis from upstream.

pdfium's Readme acknowledges the possibility of endianness bugs (may affect s390x).
Also, the build strategy we are using (native tools & libraries) is not documented or officially supported upstream. However, the toolchain-based build they are suggesting is not portable.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions