image extract error #2848

1339503169 · 2023-11-29T10:37:05Z

1339503169
Nov 29, 2023

Please provide all mandatory information!

Describe the bug (mandatory)

You mentioned that it is possible to extract images from a PDF using the get_text('dict') method by setting type==1, and also by using the get_images() method. In a document I am working with, the number of images extracted by these two methods is different. Is this normal? Or is there another way to determine the position of the images extracted by get_images() in the PDF?

To Reproduce (mandatory)

import fitz
document = fitz.open('./data/problem.pdf')
page = document.load_page(0)
text_blocks = page.get_text('dict')['blocks']
img_blocks = [i for i in text_blocks if i['type'] == 1]
images = page.get_images()
print(len(img_blocks))
print(len(images))

problem.pdf

If applicable, add screenshots to help explain your problem.

Answered by JorjMcKie

Nov 29, 2023

There indeed is a difference between the two ways:

images extracted via get_text("dict") internally are restricted to a clip rectangle equal to the page itself: any image not completely contained in page.rect is omitted
images reported via page.get_image_info() do not contain this restriction

You can adjust that by using clip=fitz.INFINITE_RECT() in the get_text() method.

View full answer

JorjMcKie · 2023-11-29T11:47:55Z

JorjMcKie
Nov 29, 2023
Maintainer

There indeed is a difference between the two ways:

images extracted via get_text("dict") internally are restricted to a clip rectangle equal to the page itself: any image not completely contained in page.rect is omitted
images reported via page.get_image_info() do not contain this restriction

You can adjust that by using clip=fitz.INFINITE_RECT() in the get_text() method.

0 replies

1339503169 · 2023-11-30T01:26:46Z

1339503169
Nov 30, 2023
Author

it helps a lot

0 replies

1339503169 · 2023-11-30T01:27:01Z

1339503169
Nov 30, 2023
Author

thank you

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

image extract error #2848

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

image extract error #2848

Uh oh!

1339503169 Nov 29, 2023

Describe the bug (mandatory)

To Reproduce (mandatory)

Replies: 3 comments

Uh oh!

JorjMcKie Nov 29, 2023 Maintainer

Uh oh!

1339503169 Nov 30, 2023 Author

Uh oh!

1339503169 Nov 30, 2023 Author

1339503169
Nov 29, 2023

JorjMcKie
Nov 29, 2023
Maintainer

1339503169
Nov 30, 2023
Author

1339503169
Nov 30, 2023
Author