Help trying to get Text before Picture and use as context for the vision model... #1317
Replies: 4 comments 9 replies
-
@dolfim-ibm Do you have any insight on this? |
Beta Was this translation helpful? Give feedback.
-
+1 This is a super valid use case! |
Beta Was this translation helpful? Give feedback.
-
@dolfim-ibm I was wondering if this could be baked into the PictureDescription capability itself. WDYF? |
Beta Was this translation helpful? Give feedback.
-
Hi folks, I added support for extracting text surrounding the Picture and using it to prompt the VLM. Check out this branch https://github.com/rafaeltuelho/docling/tree/picture-vlm-context-aware. I created a new Pytest that exercises this new capability https://github.com/rafaeltuelho/docling/blob/picture-vlm-context-aware/tests/test_picture_description.py You can run this specific test scenario with
This test uses a public financial report PDF as a document source that contains a real bar chart picture. But you can change the Let me know what you think. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
When trying to get a textual description of a Picture, I found it would be useful to get some context from the text right before the Picture and use it as context alongside the Picture itself. Just asking a VLM to describe the picture without additional context is not helpful, as the model gives a generic description that may not make sense for the context of the document the picture comes from.
I'm trying to implement a function to get the
TextItem
s immediately before a Picture. Here is what I'm trying to doThe problem is that
docling_document.groups
return a list ofRefItem
. How do I go from RefItem instance to the actual Item (Text, Picture, Table)?I would appreciate it if someone on the team could shed light on helping me accomplish this.
Thanks.
Beta Was this translation helpful? Give feedback.
All reactions