How to get the text linked to layers? #2301

jase64 · 2023-03-24T11:20:34Z

jase64
Mar 24, 2023

Hi Community,

First of all, thanks a lot to the developers (and other helpers) to bring that great library for struggling against the PDF format (mess?).

I currently have a set of PDFs that were generated by Autocad (version 14). The PDFs have been constructed taken into consideration the layers defined in Autocad. I'm trying to read the text "contained" in (or controlled by) those layers.
I first tried a basic text extraction

import fitz
doc = fitz.open(filename)
page = doc[0]
text = page.get_text("text")

Unfortunately, textdoes not return the expected data.
So, I tried to retrieve the ocgs

ocgs = doc.get_ocgs()

'ocgs' returns a dictionary with layer ids and names

{...
32: {'name': 'PROCESS', 'intent': [], 'on': True, 'usage': None},
 12: {'name': 'EQUIPMENT', 'intent': [], 'on': True, 'usage': None},
 38: {'name': 'VENDOR', 'intent': [], 'on': True, 'usage': None},
...}

Then I tried to dig into the xrefs

len_xref=doc.xref_length()
for xref in range(1, len_xref):
    print('')
    print(f"Object {xref}, stream: {doc.xref_is_stream(xref)}")
    print(doc.xref_object(xref, compressed=False))

It returns again a list of objects, but no link to the actual content of the objects.

...
Object 12, stream: False
<<
  /Name (EQUIPMENT)
  /Type /OCG
>>
...

Has anyone an idea how to get access to the text contained in layers? in the oc? If so, how to link the layer to the

PS: for the time being I don't want to use OCR extraction. Only if no other choice.

Answered by JorjMcKie

Mar 24, 2023

Thanks for the nice feedback!

Before a text extraction, you could temporarily switch to the desired layer using doc.set_layer_ui_config(number, action=0). This is what you would do using a supporting PDF viewer. Available configurations can be display by doc.layer_ui_configs().
The action parameter is 0 = set on (default), 1 = toggle on/off, 2 = set off.

View full answer

JorjMcKie · 2023-03-24T12:38:49Z

JorjMcKie
Mar 24, 2023
Maintainer

Thanks for the nice feedback!

Before a text extraction, you could temporarily switch to the desired layer using doc.set_layer_ui_config(number, action=0). This is what you would do using a supporting PDF viewer. Available configurations can be display by doc.layer_ui_configs().
The action parameter is 0 = set on (default), 1 = toggle on/off, 2 = set off.

0 replies

jase64 · 2023-03-25T07:50:24Z

jase64
Mar 25, 2023
Author

Hi Jorj,
Thanks a lot for the hint. It worked fine. I didn't realize it could as simple as turning off all layers except the one of interest and then make a simple page.get_text('text').
Great work!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to get the text linked to layers? #2301

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to get the text linked to layers? #2301

Uh oh!

jase64 Mar 24, 2023

Replies: 2 comments

Uh oh!

JorjMcKie Mar 24, 2023 Maintainer

Uh oh!

jase64 Mar 25, 2023 Author

jase64
Mar 24, 2023

JorjMcKie
Mar 24, 2023
Maintainer

jase64
Mar 25, 2023
Author