Sequencing the annotations for extraction #1571

dummifiedme · 2022-01-29T03:56:16Z

dummifiedme
Jan 29, 2022

I have documents often having boxes and images as shown ☝️

When I try to extract highlights, I sort them by using this viz

annots = [a for a in page.annots()]
annots.sort(key=lambda a: (a.rect.bl.y, a.rect.bl.x))
# now loop through the sorted list of annotations

Now, my extracted note looks like: (Note the arrows)

I can understand that the sequence is taking going along x first then moves down towards y. And the extracted output is consistent with it.

I was wondering, if I wish to keep the sections separate. The flow of text to be followed. And the arrow marked stuff shouldn't interfere int the flow. As in, what should I do if I wish to have a somewhat better extraction

with the text nearby OR
identifying sections OR
identify background colour?

Anything really. How can I make it work, if at all?
Will it require a lot of effort?

EDIT: If I don't employ the sorting as above, I get a slightly better result. But I would just like to learn the best practices of some kind :D
Whether the PDF elements like rectangle (the boxes in the PDF) are identifiable. Blocks are identifiable. Something like that.

JorjMcKie · 2022-01-29T06:17:57Z

JorjMcKie
Jan 29, 2022
Maintainer

You have all sorts of information available. Apart from annotations, these are

the locations of any images
the locations of blocks of text
all drawings with their coordinates. This would include rectangles and their fill colors, lines, curves.

So there is no shortage here. The problem therefore is, how much heuristics is needed to determine a page's structure.
If your population of pages is very diverse, then you obviously can have a very difficult task here.
On the other hand - a bit suggested by your example - if you always have a 2-column like structure, things are a lot easier.

You could make a list of blocks of text on the page: textblocks = page.get_text("blocks"). Every items therein start with the 4 floats that define the envelopping rectangle.
Then make a sublist of blocks that are located on the left side, e.g. by their x0 is smaller than page.rect.width/2.
Sort it ascending y1.
Then append the remaining blocks, also sorted like that.

I am assuming that your annotations are related to some text each. So you now could take each annot and associate its rectangle with the respective text block rectangle. Maybe via a dictionary: key is the text block's bbox tuple (which is hashable), value is the (possibly empty) list of annotations referring to some of that block's text. This list could be sorted, too.

Come close?

2 replies

dummifiedme Jan 29, 2022
Author

I am trying to work it out.

Not all my PDFs are two columned. Even the page I shared isn't exactly two columned all the way.

I was hoping we could somehow decide the boundaries based on the cluster of words.
eg. when we are going from (0,y) to (Xmax, y), it will look for breaks or gaps. to identify blocks.

You could make a list of blocks of text on the page: textblocks = page.get_text("blocks"). Every items therein start with the 4 floats that define the envelopping rectangle.

I am yet trying to use the get_text("block"). in that regard, is there a way I can impose a layer of my block rect data to see it visually?

PS. I am a novice at programming. I like to learn here and there, but overall a newbie :D
So please don't mind if I sound stupid.

dummifiedme Jan 29, 2022
Author

I tried comparing the "block" output with the PDF, and it seems to me that, if I am able to keep the extractions of adjacent blocks together, I may get a better result!

I don't know how to compare the rects or locations of the block though. I will try to look into the documentations

JorjMcKie · 2022-01-29T15:04:13Z

JorjMcKie
Jan 29, 2022
Maintainer

what I mean is this:

blocks = page.get_text("blocks")
blocks_left = []
blocks_right = []
for b in blocks:
    bbox = fitz.Rect(b[:4])  # rect of the text block
    if bbox.x0 < page.rect.width/2:  # text block's left border is left of middle of page
        blocks_left.append(bbox)
    else:
        blocks_right.append(bbox)  # block is on right half of page
blocks_left.sort(key=lambda r: r.y1)
blocks_right.sort(key=lambda r: r.y1)
text_bboxes = blocks_left + blocks_right
# we now have a list of text block rectangles
# associate each text marker annotation with the text block, within which it marks some text:
annot_dict = {}
# this dict has the block rect as key and a list of annotations as value
for annot in page.annots():
    if annot.type[0] not in (fitz.PDF_ANNOT_HIGHLIGHT, fitz.PDF_ANNOT_SQUIGGLY, fitz.PDF_ANNOT_UNDERLINE, fitz.PDF_ANNOT_STRIKE_OUT):
        continue
    for rect in text_boxes:
        if annot.rect in rect:
            annot_list = annot_dict.get(tuple(rect), [])
            annot.list.append(annot)
            annot_dict[tuple(rect)] = annot_list
# Done
# walk through annot_dict.keys() to get the sublist of text markers for the corresp. text block

4 replies

dummifiedme Jan 29, 2022
Author

I tried to incorporate this code.
I understand that it is trying to club left and right blocks together.
Then text_boxes is left + right combined.

What I don't understand is the later operations. What is annot_dict? What is its purpose?
Also, when I try to plot rects over the dict entries, I get only a few boxes and not all the blocks.

dummifiedme Jan 29, 2022
Author

This is the code I was trying with

import os
import fitz
import re

filelist = os.listdir()

for filename in filelist:
    if not filename.endswith(".pdf"):
        continue
    
    if not filename.startswith("2021"):
        continue


    doc = fitz.open(filename)
    print(doc.name)

    basename = os.path.splitext(filename)[0]


    # Name of the folder with the files for a pdf
    foldername = "& " + basename

    # create folder for the file selected if not already present
    if not os.path.isdir(foldername):
        os.makedirs(foldername)

    blockWord_Doc_Path = "./" + foldername + "/BLOCKS " + basename + ".md"
    blockWord_Doc = open(blockWord_Doc_Path, "w")

    for page in doc:

        if page.number in (9,10):
            # textlist = page.get_text("text")
            # wordlist = page.get_text("words")
            # blocklist= page.get_text("blocks")

        
            # blockWord_Doc.write("\n\n# Page {}\n".format(page.number))
            # blockWord_Doc.write("\n## Text List\n{}".format(textlist))
            # blockWord_Doc.write("\n## Word List\n{}".format(wordlist))
            # blockWord_Doc.write("\n## Block List\n{}".format(blocklist))
            blocks = page.get_text("blocks")
            bN = 0
            

            blocks_left = []
            blocks_right = []
            for b in blocks:
                bbox = fitz.Rect(b[:4])  # rect of the text block
                if bbox.x0 < page.rect.width/2:  # text block's left border is left of middle of page
                    blocks_left.append(bbox)
                else:
                    blocks_right.append(bbox)  # block is on right half of page
            blocks_left.sort(key=lambda r: r.y1)
            blocks_right.sort(key=lambda r: r.y1)
            text_boxes = blocks_left + blocks_right
            print(text_boxes)
            # we now have a list of text block rectangles
            # associate each text marker annotation with the text block, within which it marks some text:
            annot_dict = {}
            # this dict has the block rect as key and a list of annotations as value
            for annot in page.annots():
                if annot.type[0] not in (fitz.PDF_ANNOT_HIGHLIGHT, fitz.PDF_ANNOT_SQUIGGLY, fitz.PDF_ANNOT_UNDERLINE, fitz.PDF_ANNOT_STRIKE_OUT):
                    continue
                for rect in text_boxes:
                    if annot.rect in rect:
                        annot_list = annot_dict.get(tuple(rect), [])
                        annot_list.append(annot)
                        annot_dict[tuple(rect)] = annot_list
            # Done
            # walk through annot_dict.keys() to get the sublist of text markers for the corresp. text block
            print(annot_dict.keys())

            for rect in text_boxes:
                img=page.new_shape()
                # rect= block[0:4]
                point = rect[0:2]
                img.draw_rect(rect)
                img.insert_textbox(rect,str(bN) , fontsize=11, fontname='helv', color=(0,0,1))
                print(str(bN))
                img.finish(color=(0, 0, 1), fill=None, closePath=False)
                img.commit()
                bN+=1
            # for annot in page.annots():
            #     if annot.type[0] not in (fitz.PDF_ANNOT_HIGHLIGHT, fitz.PDF_ANNOT_SQUIGGLY, fitz.PDF_ANNOT_UNDERLINE, fitz.PDF_ANNOT_STRIKE_OUT):
            #         continue
                
            #     if 
                # blockWord_Doc.write()
    blockWord_Doc.close()
doc.save("XYZ" + basename + ".pdf")
doc.close()

gives this output

With rect in annot_dict.keys():

dummifiedme Jan 29, 2022
Author

The problem is I am unable to figure out a proper strategy.
I think best would be to look at the first block and keep storing the blocks starting at the same x0 level.

Then look for the blocks towards the right of it. say width/3 or width/4. And store and print them after the first collection.

dummifiedme Jan 29, 2022
Author

I think, it would be best to leave it as is. And fix the note later on manually :D

Another curious question though:
Can we identify the images or the rectangles drawn in the PDF?
Can we constrain the extractions or sort the blocks such that the ones inside a "rectangle or images" stay together...

JorjMcKie · 2022-01-29T15:07:23Z

JorjMcKie
Jan 29, 2022
Maintainer

is there a way I can impose a layer of my block rect data to see it visually?

Sure: just draw a corresponding rectangle:

for b in blocks:  # draw a thin-lined red rect around each text block
    page.draw_rect(b[:4], color=(1,0,0), width=0.3)
# then save it to a separate new PDF ...

1 reply

dummifiedme Jan 29, 2022
Author

I was just coming over to gallantly announce that I was able to atleast achieve this using docs and examples :p (little victories)

Sequencing the annotations for extraction #1571

Uh oh!

Uh oh!

dummifiedme Jan 29, 2022

Replies: 3 comments · 7 replies

Uh oh!

JorjMcKie Jan 29, 2022 Maintainer

Uh oh!

dummifiedme Jan 29, 2022 Author

Uh oh!

dummifiedme Jan 29, 2022 Author

Uh oh!

JorjMcKie Jan 29, 2022 Maintainer

Uh oh!

dummifiedme Jan 29, 2022 Author

Uh oh!

Uh oh!

dummifiedme Jan 29, 2022 Author

Uh oh!

dummifiedme Jan 29, 2022 Author

Uh oh!

Uh oh!

dummifiedme Jan 29, 2022 Author

Uh oh!

JorjMcKie Jan 29, 2022 Maintainer

Uh oh!

dummifiedme Jan 29, 2022 Author

dummifiedme
Jan 29, 2022

Replies: 3 comments 7 replies

JorjMcKie
Jan 29, 2022
Maintainer

dummifiedme Jan 29, 2022
Author

dummifiedme Jan 29, 2022
Author

JorjMcKie
Jan 29, 2022
Maintainer

dummifiedme Jan 29, 2022
Author

dummifiedme Jan 29, 2022
Author

dummifiedme Jan 29, 2022
Author

dummifiedme Jan 29, 2022
Author

JorjMcKie
Jan 29, 2022
Maintainer

dummifiedme Jan 29, 2022
Author