Unable to identify cropped region in images #3006

abe-mxff · 2024-01-10T10:03:10Z

abe-mxff
Jan 10, 2024

Hi, I'm not able to detect cropped image using the get_bboxlog() method (fitz version 1.23.7).

I generated the attached PDF with two cropped image (one rotated 90°), but the extraction gives me the bounding boxes of the non-cropped images:

Image 0 - bbox: Rect(266.25, 157.2283935546875, 328.5, 608.3989868164062)
Image 1 - bbox: Rect(73.5, 73.5, 568.5, 142.5)
1 - Type: 'fill-path', width=595.5 height=842.25 (raw = (0.0, 0.0, 595.5, 842.25))
2 - Type: 'fill-image', width=62.25 height=451.17059326171875 (raw = (266.25, 157.2283935546875, 328.5, 608.3989868164062))
3 - Type: 'fill-image', width=495.0 height=69.0 (raw = (73.5, 73.5, 568.5, 142.5))

In the following the rendered PDF page and the script used to replicate the result. What am I doing wrong?

import fitz

fn_in = "test_page.pdf"

with open(fn_in, "rb") as f:
    doc = fitz.open(f)

page = doc.load_page(0)

# Extract images
imgs = []
for i, img in enumerate(page.get_image_info(xrefs=True)):
    xref = img["xref"]
    img["bbox"] = fitz.Rect(img["bbox"])
    print(f"Image {i} - bbox: {img['bbox']}")
    img["transform"] = fitz.Matrix(img["transform"])
    imgs.append(img)

# Get bbox_log
for i, (type, raw) in enumerate(page.get_bboxlog()):
    rect = fitz.Rect(raw)
    print(f"{i+1} - Type: '{type}', width={rect.width} height={rect.height} (raw = {raw})")

# There are three elements
# 1) A rectangle occupying the full page (I don't know why it is there)
# 2) The first image
# 3) The second image (correctly detect rotation)
# PROBLEM: None of the images are cropped

# Here images are correctly cropped
# page.get_pixmap().save('rendered_page.png')

Originally posted by @abe-mxff in #1312 (comment)

JorjMcKie · 2024-01-10T10:44:55Z

JorjMcKie
Jan 10, 2024
Maintainer

This is a "Discussions" item - no bug, as far as we can see at the moment.

0 replies

JorjMcKie · 2024-01-10T10:47:46Z

JorjMcKie
Jan 10, 2024
Maintainer

Please provide an example file - not just an image.

0 replies

abe-mxff · 2024-01-11T09:40:00Z

abe-mxff
Jan 11, 2024
Author

I might have found a (partial) solution, but some problems still remains.

The process I follow now is the following:

I extract images (with page.get_image_info(xrefs=True)) and drawings (with page.get_cdrawings(extended=True), the extended argument is what provides some information about the cropping)
I iterate over page.get_bboxlog() and retrieve the object by the type and by its bounding box
For each object I get the absolute position in the page
Finally I apply all the clip steps in the drawings list

The main problems I still have are:

Which is the proper way to obtain the relations between the bboxlog and the objects (two objects might have the same bounding box)
Clipping is applied not to all the images but only to some of them, but I cannot see how to obtain that information from the drawings. I also cannot see an order of execution of the clipping/images (in level or in the bboxlog order)

Original document: test_page_2.pdf

Problem (2) affect the output (note that it is not present the upside-down clipped text)

The output has been produced by the following script

import fitz
import numpy as np
import cv2
from functools import reduce

fn_in = "test_page_2.pdf"

doc = fitz.open(fn_in)
page = doc.load_page(0)
bbox_log = page.get_bboxlog()
images = page.get_images(full=True)

# Extract clips
clips = []
drawings = page.get_cdrawings(extended=True)
for drw in drawings:
    if drw["type"] != "clip" or drw["level"] != 0:
        continue
    clips.append(drw["scissor"])

# Extract images
images = page.get_image_info(xrefs=True)
# Enrich
for img in images:
    xref = img["xref"]
    pix = fitz.Pixmap(doc, xref)
    image = (
        np.frombuffer(pix.samples_mv, dtype=np.uint8).reshape((pix.height, pix.width, -1)).copy()
    )
    img["page"] = page.number
    img["image"] = image
    img["bbox"] = fitz.Rect(img["bbox"]) * page.rotation_matrix
    img["transform"] = fitz.Matrix(img["transform"]) * page.rotation_matrix
    img["page_rotation"] = page.rotation_matrix

# Recreate page
dpi_ocr = 300
pw, ph = [int(round(d / 72 * dpi_ocr)) for d in [page.rect.width, page.rect.height]]
page_img = np.full((ph, pw, 3), np.iinfo(np.uint8).max, dtype=np.uint8)
objs = []
for o_type, o_rect in bbox_log:
    j0, i0, j1, i1 = [d / 72 * dpi_ocr for d in o_rect]
    if o_type == "fill-path":
        for obj in [d for d in drawings if d["type"] == "f" and d["rect"] == o_rect]:
            i0, j0 = max(int(np.floor(i0)), 0), max(int(np.floor(j0)), 0)
            i1, j1 = min(int(np.ceil(i1)), ph), min(int(np.floor(j1)), pw)
            image = np.stack(
                [
                    np.full(
                        (i1 - i0, j1 - j0),
                        int(round(channel * np.iinfo(np.uint8).max)),
                        dtype=np.uint8,
                    )
                    for channel in obj["fill"] + (obj["fill_opacity"],)
                ],
                axis=-1,
            )
            img = np.full((ph, pw, 4), np.iinfo(np.uint8).max, dtype=np.uint8)
            img[..., 3] = 0
            img[i0:i1, j0:j1, :] = image
            objs.append(img)

    if o_type == "fill-image":
        for obj in [i for i in images if i["bbox"] == o_rect]:
            image = obj["image"].copy()
            # Transform to rgba
            if image.ndim == 2 or image.shape[2] == 1:
                image = cv2.cvtColor(image[:, :], cv2.COLOR_GRAY2RGBA)
            elif image.shape[2] == 3:
                image = cv2.cvtColor(image, cv2.COLOR_RGB2RGBA)

            # Eventually resample to desired dpi
            res_1 = max(image.shape[:2]) / (max(obj["bbox"].width, obj["bbox"].height) / 72)
            res_2 = min(image.shape[:2]) / (min(obj["bbox"].width, obj["bbox"].height) / 72)
            if abs(res_1 - res_2) > 1:
                dpi_img = int(round(res_1))
            else:
                dpi_img = int(round(0.5 * (res_1 + res_2)))
            if abs(dpi_ocr - dpi_img) > 1:
                h, w = [int(round(d / dpi_img * dpi_ocr)) for d in image.shape[:2]]
                ocr_image = cv2.resize(image, (w, h))
            else:
                h, w = image.shape[:2]
                ocr_image = image

            # Transform to page coordinates
            tr_px2pt = fitz.Matrix(1 / w, 0, 0, 1 / h, 0, 0) * obj["transform"]
            pt2px = dpi_ocr / 72  # (points) -> px
            tr_pt2px = fitz.Matrix(pt2px, 0, 0, pt2px, 0, 0)
            tr_px2px = tr_px2pt * tr_pt2px

            mat = np.array(
                [
                    [tr_px2px.a, tr_px2px.c, tr_px2px.e],
                    [tr_px2px.b, tr_px2px.d, tr_px2px.f],
                ],
                dtype=np.float32,
            )
            img = cv2.warpAffine(ocr_image, mat, (pw, ph))
            objs.append(img)

page_img = reduce(
    lambda a, b: a[..., :3] * (1 - b[:, :, 3][..., np.newaxis] / np.iinfo(np.uint8).max)
    + b[:, :, :3] * (b[:, :, 3][..., np.newaxis] / np.iinfo(np.uint8).max),
    objs,
    page_img,
).astype(np.uint8)

bw_page = cv2.cvtColor(page_img, cv2.COLOR_RGB2GRAY)

# Apply clipping
mask_white = np.full((ph, pw), False if len(clips) == 0 else True, dtype=bool)
for rect in clips:
    j0, i0, j1, i1 = [d / 72 * dpi_ocr for d in rect]
    i0, j0 = max(int(np.floor(i0)), 0), max(int(np.floor(j0)), 0)
    i1, j1 = min(int(np.ceil(i1)), ph), min(int(np.floor(j1)), pw)
    mask_white[i0:i1, j0:j1] = False

bw_page[mask_white] = np.iinfo(np.uint8).max
cv2.imwrite("page_image.png", bw_page)

0 replies

JorjMcKie · 2024-01-11T15:29:26Z

JorjMcKie
Jan 11, 2024
Maintainer

Which is the proper way to obtain the relations between the bboxlog and the objects (two objects might have the same bounding box)

All methods get_bboxlog(), get_drawings() and get_text() work exactly the same for all supported document types (not just PDFs).
Therefore, their respective objects cannot unambiguously be identified by an object number like the xref in PDF. Only their occurrence in these lists can be used for identification purposes.
If you will the bboxlog is the master catalog of things on the page and the sequence of its items equals the sequence in which the objects are painted.

For get_drawings(), the dictionary key "seqno" refers to the sequence number in bboxlog. If the type is "fs", then drawing is merged together from previously two separate, consecutive items in bboxlog: an "f" (fill-path) and an "s" (stroke-path) bbox - which bythe way have no identical bboxes, because the stroke bbox is a little larger than (and thus contains) the fill bbox. This happens, because of MuPDF's logic to decompose original drawings, that at the same time stroke and fill a graphic.

If two text bboxes fully equal each other (rare!), then this usually happens because of creator-intended simulation of text effects like boldness or text shading.
By looking at get_text() alone, no clue can be derived which one is which in such a case. But there also is page.get_texttrace(), a lower-level text (only) extraction which also has a "seqno" key in its dictionaries like the vector graphics.
Method get_text() on the other hand can extract text and images in their original sequence.

Clipping info extraction currently is only supported for vector graphics, not for text, images or shadings.
And only the clipping components are extracted, i.e. no effort is made to apply the resulting clipping area to objects underneath it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unable to identify cropped region in images #3006

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Unable to identify cropped region in images #3006

Uh oh!

abe-mxff Jan 10, 2024

Replies: 4 comments

Uh oh!

JorjMcKie Jan 10, 2024 Maintainer

Uh oh!

JorjMcKie Jan 10, 2024 Maintainer

Uh oh!

abe-mxff Jan 11, 2024 Author

Uh oh!

JorjMcKie Jan 11, 2024 Maintainer

abe-mxff
Jan 10, 2024

JorjMcKie
Jan 10, 2024
Maintainer

JorjMcKie
Jan 10, 2024
Maintainer

abe-mxff
Jan 11, 2024
Author

JorjMcKie
Jan 11, 2024
Maintainer