Obtain the layer of each element in the PDF #2986

more-strive · 2024-01-08T02:37:31Z

more-strive
Jan 8, 2024

Is your feature request related to a problem? Please describe.
I have an image editing application here
Demo: https://yft.design
Github: https://github.com/dromara/yft-design
I hope to obtain the layer of elements (image, path, text) when importing PDF, so that the position, content, and size of all elements can be resolved now

def get_elements(page):
    draws = page.get_drawings()
    texts = page.get_text('dict', flags=11)
    images = page.get_image_info(hashes=True, xrefs=True)
    return [*draws, *images, *texts]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
Are there several options for how your request could be met?

Additional context
Add any other context or screenshots about the feature request here.

Answered by JorjMcKie

Jan 8, 2024

Yes: method Page.get_text("dict") extracts text and images when using the default flags.
The sequence of the extracted image and text blocks are like in the page's /Contents.

The full sequence of all boundary boxes of everything on the page is reflected by the list page.get_bboxlog(). The items in this list look like (obj-type, bbox).
So you can take the bbox of an image or some text and then determine the index in the bboxlog that contains it.

View full answer

JorjMcKie · 2024-01-08T07:53:15Z

JorjMcKie
Jan 8, 2024
Maintainer

Everything you want is already implemented it seems.
You can extract images with all metadata, same is true for vector graphics and for text.
Please be more specific about what you are missing.

0 replies

more-strive · 2024-01-08T08:07:08Z

more-strive
Jan 8, 2024
Author

The square, trapezoid, and other paths in the middle of the image, The parsed path is covered by an image. Sorry, I didn't mark it

def get_elements(page):
    draws = page.get_drawings()
    texts = page.get_text('dict', flags=11)
    images = page.get_image_info(hashes=True, xrefs=True)
    return [*draws, *images, *texts]

The code here parses the image element after the path, so is there any way to redefine their index

0 replies

JorjMcKie · 2024-01-08T08:34:54Z

JorjMcKie
Jan 8, 2024
Maintainer

This is a typical "Discussions" post. Let me move this accordingly.

2 replies

more-strive Jan 8, 2024
Author

The path element has been parsed, but it is covered by an image

more-strive Jan 8, 2024
Author

JorjMcKie · 2024-01-08T08:46:33Z

JorjMcKie
Jan 8, 2024
Maintainer

This cannot be changed. You could delete the image of course and then re-insert it before everything else on the page.

3 replies

more-strive Jan 8, 2024
Author

Is there a way to know the index of the image

JorjMcKie Jan 8, 2024
Maintainer

Yes: method Page.get_text("dict") extracts text and images when using the default flags.
The sequence of the extracted image and text blocks are like in the page's /Contents.

The full sequence of all boundary boxes of everything on the page is reflected by the list page.get_bboxlog(). The items in this list look like (obj-type, bbox).
So you can take the bbox of an image or some text and then determine the index in the bboxlog that contains it.

Answer selected by more-strive

more-strive Jan 8, 2024
Author

This interface only obtains the indexes of text and image. How should the index of path be compared with text and image

JorjMcKie · 2024-01-08T09:32:24Z

JorjMcKie
Jan 8, 2024
Maintainer

This interface only obtains the indexes of text and image. How should the index of path be compared with text and image

That is what I was referring to: there is no way extract every object type in one single method.
So you must take the bboxes of text, images, graphics and lookup their positions in the bboxlog list.

1 reply

more-strive Jan 8, 2024
Author

Thank you, I'll give it a try

Obtain the layer of each element in the PDF #2986

Uh oh!

more-strive Jan 8, 2024

Replies: 5 comments · 6 replies

Uh oh!

JorjMcKie Jan 8, 2024 Maintainer

Uh oh!

Uh oh!

more-strive Jan 8, 2024 Author

Uh oh!

JorjMcKie Jan 8, 2024 Maintainer

Uh oh!

more-strive Jan 8, 2024 Author

Uh oh!

more-strive Jan 8, 2024 Author

Uh oh!

JorjMcKie Jan 8, 2024 Maintainer

Uh oh!

more-strive Jan 8, 2024 Author

Uh oh!

JorjMcKie Jan 8, 2024 Maintainer

Uh oh!

more-strive Jan 8, 2024 Author

Uh oh!

JorjMcKie Jan 8, 2024 Maintainer

Uh oh!

more-strive Jan 8, 2024 Author

more-strive
Jan 8, 2024

Replies: 5 comments 6 replies

JorjMcKie
Jan 8, 2024
Maintainer

more-strive
Jan 8, 2024
Author

JorjMcKie
Jan 8, 2024
Maintainer

more-strive Jan 8, 2024
Author

more-strive Jan 8, 2024
Author

JorjMcKie
Jan 8, 2024
Maintainer

more-strive Jan 8, 2024
Author

JorjMcKie Jan 8, 2024
Maintainer

more-strive Jan 8, 2024
Author

JorjMcKie
Jan 8, 2024
Maintainer

more-strive Jan 8, 2024
Author