Skip to content

[Converter Request] (PDF to alto/xml) #482

@jdijk-deventit

Description

@jdijk-deventit

Title: Enhancement: Integration of pdfalto for improved PDF-to-ALTO conversion

What file formats are missing?
Currently, our workflow involves various types of PDF documents. While we are able to process them, the current conversion process is inefficient and time-consuming. We have previously found that pdfalto (https://github.com/kermitt2/pdfalto) provides the structure and accuracy we need, but it is currently not integrated into the existing pipeline.

What converter should be added?
I suggest adding a dedicated converter based on pdfalto.
The goal is to streamline the conversion from PDF to the ALTO (Analyzed Layout and Text Object) XML format. Since pdfalto is specifically designed for this purpose, integrating it would significantly reduce the manual effort currently required to make PDFs compatible with our internal structure.

Are you willing to add it?

  • Yes
  • No
    (Note: I am reporting this as a requested improvement for the maintainers, as I am currently unable to implement the integration myself.)

Additional context
Using pdfalto would offer better support for OCR data and layout preservation compared to our current methods. You can find the tool and documentation here: https://github.com/kermitt2/pdfalto

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions