Sequencing the annotations for extraction #1571
Replies: 3 comments 7 replies
-
You have all sorts of information available. Apart from annotations, these are
So there is no shortage here. The problem therefore is, how much heuristics is needed to determine a page's structure. You could make a list of blocks of text on the page: I am assuming that your annotations are related to some text each. So you now could take each annot and associate its rectangle with the respective text block rectangle. Maybe via a dictionary: key is the text block's bbox tuple (which is hashable), value is the (possibly empty) list of annotations referring to some of that block's text. This list could be sorted, too. Come close? |
Beta Was this translation helpful? Give feedback.
-
what I mean is this: blocks = page.get_text("blocks")
blocks_left = []
blocks_right = []
for b in blocks:
bbox = fitz.Rect(b[:4]) # rect of the text block
if bbox.x0 < page.rect.width/2: # text block's left border is left of middle of page
blocks_left.append(bbox)
else:
blocks_right.append(bbox) # block is on right half of page
blocks_left.sort(key=lambda r: r.y1)
blocks_right.sort(key=lambda r: r.y1)
text_bboxes = blocks_left + blocks_right
# we now have a list of text block rectangles
# associate each text marker annotation with the text block, within which it marks some text:
annot_dict = {}
# this dict has the block rect as key and a list of annotations as value
for annot in page.annots():
if annot.type[0] not in (fitz.PDF_ANNOT_HIGHLIGHT, fitz.PDF_ANNOT_SQUIGGLY, fitz.PDF_ANNOT_UNDERLINE, fitz.PDF_ANNOT_STRIKE_OUT):
continue
for rect in text_boxes:
if annot.rect in rect:
annot_list = annot_dict.get(tuple(rect), [])
annot.list.append(annot)
annot_dict[tuple(rect)] = annot_list
# Done
# walk through annot_dict.keys() to get the sublist of text markers for the corresp. text block |
Beta Was this translation helpful? Give feedback.
-
Sure: just draw a corresponding rectangle: for b in blocks: # draw a thin-lined red rect around each text block
page.draw_rect(b[:4], color=(1,0,0), width=0.3)
# then save it to a separate new PDF ... |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I have documents often having boxes and images as shown ☝️
When I try to extract highlights, I sort them by using this viz
Now, my extracted note looks like: (Note the arrows)

I can understand that the sequence is taking going along x first then moves down towards y. And the extracted output is consistent with it.
I was wondering, if I wish to keep the sections separate. The flow of text to be followed. And the arrow marked stuff shouldn't interfere int the flow. As in, what should I do if I wish to have a somewhat better extraction
Anything really. How can I make it work, if at all?
Will it require a lot of effort?
EDIT: If I don't employ the sorting as above, I get a slightly better result. But I would just like to learn the best practices of some kind :D
Whether the PDF elements like rectangle (the boxes in the PDF) are identifiable. Blocks are identifiable. Something like that.
Beta Was this translation helpful? Give feedback.
All reactions