Get page number and bounding box when dealing with docx #997
-
Hi everyone, I built a chunking pipeline using Docling HybridChunker, and I need the page number and bounding box informations for my chunks. Unfortunately, I cant' get to collect these for .DOCX. Do you have a solution ? Or is it not possible for this document type ? Thank you for your consideration |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Beta Was this translation helpful? Give feedback.
-
@shubham-wysa You will not be able obtain bounding box and page information for |
Beta Was this translation helpful? Give feedback.
@shubham-wysa You will not be able obtain bounding box and page information for
.docx
files, since internally.docx
files do not track these information, they are simply a tree of text elements. Page information result from rendering.docx
files through a viewer (e.g., MS Word). If you require these information, you should convert topdf
before ingestion.