Skip to content

Behaviour of document.save_as_json() with referenced mode of images #399

@vku-ibm

Description

@vku-ibm

Currently the save_as_json(filename, image_mode=ImageRefMode.REFERENCED) behaves in a following way:

  1. If filename is a relative path:
  • images are saved into [filename.stem]_artifacts folder next to the json file
  • the references to images are updated in the docling document and point to the relative path
  1. if filename is an absolute path:
  • images are saved into [filename.stem]_artifacts folder next to the json file, as before
  • the references to images are updated in the docling document and point to the absolute path

This behavior is consistent but not desirable in several use-cases:

  1. In the use-case of running docling inside of the container, or similar environment where we have to control where temporary files go exactly, it's very likely that filename has to be an absolute path but references to the images should still stay relative, which is currently not possible.
  2. The way how referenced images are saved and named, might require additional level of customization from the user. For example, when storing conversion results on s3, one could have a preference of saving all the images in one single prefix, because document is converted into multiple formats and those formats are also stored under different prefixes, so storing images under ..._artifacts is not an efficient option. Also, images might have to be renamed when saved, following a different naming shema, but as save_to_json now also updates references itself, we can't change image names in the references manually inside of the docling document object, before serializing it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions