Help trying to get Text before Picture and use as context for the vision model... #1317

rafaeltuelho · 2025-04-07T17:30:10Z

rafaeltuelho
Apr 7, 2025

When trying to get a textual description of a Picture, I found it would be useful to get some context from the text right before the Picture and use it as context alongside the Picture itself. Just asking a VLM to describe the picture without additional context is not helpful, as the model gives a generic description that may not make sense for the context of the document the picture comes from.

I'm trying to implement a function to get the TextItems immediately before a Picture. Here is what I'm trying to do

def get_text_items_before_picture(docling_document, picture_ref):
    text_items_before_picture = []
    found_picture = False

    # Iterate through the groups in the document
    
    for group in docling_document.groups:
        # reset the text_items_before_picture for each group
        text_items_before_picture = []
        for item in group.children:
            if isinstance(item, PictureItem) and item.get_ref().cref == picture_ref:
                # Stop searching once the Picture is found
                found_picture = True
                break
            elif isinstance(item, TextItem):
                # Collect TextItems until the Picture is found
                text_items_before_picture.append(item)

        if found_picture:
            # Return the collected TextItems for the group containing the Picture
            return text_items_before_picture

    # If no Picture with the given ref is found, return an empty list
    return []

The problem is that docling_document.groups return a list of RefItem. How do I go from RefItem instance to the actual Item (Text, Picture, Table)?

I would appreciate it if someone on the team could shed light on helping me accomplish this.

Thanks.

rafaeltuelho · 2025-04-08T14:06:05Z

rafaeltuelho
Apr 8, 2025
Author

@dolfim-ibm Do you have any insight on this?
Thanks.

5 replies

dolfim-ibm Apr 11, 2025
Maintainer

Are you trying to get to the caption of the figure or simply the paragraph before?

Captions are already referenced and you can access it via PictureItem.captions.

If you want to find the document item before a given cref, I think we should first extend the DoclingDocument for it. The prototype above should also use the doc.iterate_items().

simonschoe Apr 11, 2025

Just to note: depending on the structure and layout of the document, it might be equally reasonable to also consider the next layout element after this picture (or the next n elements for that matter). The idea would be to flexibly capture the textual context surrounding the picture to generate a more meaningful picture description than simply considering the image itself.

dolfim-ibm Apr 11, 2025
Maintainer

The best would be, once cross-ref are implemented, to use the actual paragraphs which are referencing the figure.

simonschoe Apr 11, 2025

Well, I guess you will frequently have the situation that paragraphs are not explicitly referencing a certain figure or, alternatively, referencing them in any number of different ways. For example, you might have one paragraph referencing "figure 2" but two more paragraphs dealing with the picture's content without explicitly citing it (if that is what you mean by "referencing").

Ideally, you could give the user the flexibility to decide how many preceding and trailing layout element may be picked for picture description?

rafaeltuelho Apr 15, 2025
Author

All valid concerns! Totally agree.
I started implementing a Proof of Concept here and will soon share some results.

simonschoe · 2025-04-09T05:47:33Z

simonschoe
Apr 9, 2025

+1

This is a super valid use case!

0 replies

rafaeltuelho · 2025-04-10T17:06:59Z

rafaeltuelho
Apr 10, 2025
Author

@dolfim-ibm I was wondering if this could be baked into the PictureDescription capability itself. WDYF?

1 reply

dolfim-ibm Apr 11, 2025
Maintainer

It could be interesting. Are you thinking it could be useful in the prompt? Do you have some example prompt already in mind?

One idea would be to expose {caption} and maybe {text_before} (or another name) in the prompt definition.

rafaeltuelho · 2025-04-24T23:31:29Z

rafaeltuelho
Apr 24, 2025
Author

Hi folks, I added support for extracting text surrounding the Picture and using it to prompt the VLM.

Check out this branch https://github.com/rafaeltuelho/docling/tree/picture-vlm-context-aware. I created a new Pytest that exercises this new capability https://github.com/rafaeltuelho/docling/blob/picture-vlm-context-aware/tests/test_picture_description.py

You can run this specific test scenario with

pytest --log-cli-level=INFO tests/test_picture_description.py -k "not api"

note: You may need pip install accelerator

This test uses a public financial report PDF as a document source that contains a real bar chart picture. But you can change the DOC_SOURCE declared at the top of the PyTest to try a different PDF.

Let me know what you think.

3 replies

simonschoe May 1, 2025

@rafaeltuelho This is exactly what I was hoping for: main...rafaeltuelho:docling:picture-vlm-context-aware#diff-7232dd3301f57d04fb109055164b830ff6455b4515740737b66f270f5136da8fR215-R220

Any chance this will/could be integrated into the main codebase?

rafaeltuelho May 6, 2025
Author

Thanks for looking at this and providing feedback, @simonschoe.
I will go ahead and prepare a PR.

rafaeltuelho May 13, 2025
Author

I created a draft PR here #1587

Help trying to get Text before Picture and use as context for the vision model... #1317

Uh oh!

Uh oh!

rafaeltuelho Apr 7, 2025

Replies: 4 comments · 9 replies

Uh oh!

rafaeltuelho Apr 8, 2025 Author

Uh oh!

dolfim-ibm Apr 11, 2025 Maintainer

Uh oh!

simonschoe Apr 11, 2025

Uh oh!

dolfim-ibm Apr 11, 2025 Maintainer

Uh oh!

Uh oh!

simonschoe Apr 11, 2025

Uh oh!

rafaeltuelho Apr 15, 2025 Author

Uh oh!

simonschoe Apr 9, 2025

Uh oh!

rafaeltuelho Apr 10, 2025 Author

Uh oh!

dolfim-ibm Apr 11, 2025 Maintainer

Uh oh!

Uh oh!

rafaeltuelho Apr 24, 2025 Author

Uh oh!

simonschoe May 1, 2025

Uh oh!

rafaeltuelho May 6, 2025 Author

Uh oh!

rafaeltuelho May 13, 2025 Author

rafaeltuelho
Apr 7, 2025

Replies: 4 comments 9 replies

rafaeltuelho
Apr 8, 2025
Author

dolfim-ibm Apr 11, 2025
Maintainer

dolfim-ibm Apr 11, 2025
Maintainer

rafaeltuelho Apr 15, 2025
Author

simonschoe
Apr 9, 2025

rafaeltuelho
Apr 10, 2025
Author

dolfim-ibm Apr 11, 2025
Maintainer

rafaeltuelho
Apr 24, 2025
Author

rafaeltuelho May 6, 2025
Author

rafaeltuelho May 13, 2025
Author