-
-
Notifications
You must be signed in to change notification settings - Fork 674
Description
Title: Enhancement: Integration of pdfalto for improved PDF-to-ALTO conversion
What file formats are missing?
Currently, our workflow involves various types of PDF documents. While we are able to process them, the current conversion process is inefficient and time-consuming. We have previously found that pdfalto (https://github.com/kermitt2/pdfalto) provides the structure and accuracy we need, but it is currently not integrated into the existing pipeline.
What converter should be added?
I suggest adding a dedicated converter based on pdfalto.
The goal is to streamline the conversion from PDF to the ALTO (Analyzed Layout and Text Object) XML format. Since pdfalto is specifically designed for this purpose, integrating it would significantly reduce the manual effort currently required to make PDFs compatible with our internal structure.
Are you willing to add it?
- Yes
- No
(Note: I am reporting this as a requested improvement for the maintainers, as I am currently unable to implement the integration myself.)
Additional context
Using pdfalto would offer better support for OCR data and layout preservation compared to our current methods. You can find the tool and documentation here: https://github.com/kermitt2/pdfalto