Text Alignment Parsing #2256

JFulweber · 2023-02-21T20:21:36Z

JFulweber
Feb 21, 2023

I'm interested in redacting and modifying certain words from PDF documents, which I am able to do successfully. First I am doing some location detection, redaction at that location, then adding a shape with a textbox inserted into at that same location. This works. However, I am having trouble maintaining alignment - there does not seem to be anywhere within the PyMuPDF Object's that I can access this data, I am looking at extracting with rawdict expecting perhaps a block-level attribute or even span-level attribute, but this does not seem to be available. Per the Adobe PDF reference, it specifies that there is a TextAlign attribute available for block-level structured elements. I may be misinterpriting this as I am not super familiar with the PDF standard, but I am interpreting that as the alignment info should be available somewhere in the raw PDF data, and want a way to expose that to my program.

I would like for that ideally to be added to the output of rawdict extraction, or some other means of retrieving it. If I need to get a specific textbox reference or something that is fine also. If there is any existing way to do this I would very much appreciate a pointer in the right direction, just where I have looked so far I haven't found anything. Thanks!

Answered by JorjMcKie

Feb 25, 2023

I am using a separate text insertion step with the text rect being the same size of that of the original area, but my new text may not be the same pixel width/height. In the case where it is not exactly the same (i.e. the new text does not entirely fill up the area that the old text occupied), I would like to match the alignment of the previous text. If the original text is center justified I should provide align=1 when inserting the textbox and so on - but I am not seeing any way to obtain that data, though the specification seems to imply that it may exist.

It is not possible to find out a text's original alignment - sorry. Even the PDF spec does not have this concept, at least not in…

View full answer

JorjMcKie · 2023-02-21T22:17:29Z

JorjMcKie
Feb 21, 2023
Maintainer

The replacment text in the redact method itself Is indeed only roughly aligned. If you however know that the new text would exactly fit, you can use the "origin" point from it's span to Insert. Likewise the font size. This means you need a separate text Insertion step after applying the redactions. Gesendet von Outlook für Android<https://aka.ms/AAb9ysg>

…

________________________________ From: Adair Fulweber ***@***.***> Sent: Tuesday, February 21, 2023 4:21:49 PM To: pymupdf/PyMuPDF ***@***.***> Cc: Subscribed ***@***.***> Subject: [pymupdf/PyMuPDF] Text Alignment Parsing (Issue #2244) I'm interested in redacting and modifying certain words from PDF documents, which I am able to do successfully. First I am doing some location detection, redaction at that location, then adding a shape with a textbox inserted into at that same location. This works. However, I am having trouble maintaining alignment - there does not seem to be anywhere within the PyMuPDF Object's that I can access this data, I am looking at extracting with rawdict expecting perhaps a block-level attribute or even span-level attribute, but this does not seem to be available. Per the Adobe PDF reference, it specifies that there is a TextAlign attribute available for block-level structured elements. I may be misinterpriting this as I am not super familiar with the PDF standard, but I am interpreting that as the alignment info should be available somewhere in the raw PDF data, and want a way to expose that to my program. I would like for that ideally to be added to the output of rawdict extraction, or some other means of retrieving it. If I need to get a specific textbox reference or something that is fine also. If there is any existing way to do this I would very much appreciate a pointer in the right direction, just where I have looked so far I haven't found anything. Thanks! — Reply to this email directly, view it on GitHub<#2244>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB7IDITIOP6AC5LB62SHAZLWYUPV3ANCNFSM6AAAAAAVDPM7EE>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

0 replies

JFulweber · 2023-02-22T15:28:52Z

JFulweber
Feb 22, 2023
Author

I am using a separate text insertion step with the text rect being the same size of that of the original area, but my new text may not be the same pixel width/height. In the case where it is not exactly the same (i.e. the new text does not entirely fill up the area that the old text occupied), I would like to match the alignment of the previous text. If the original text is center justified I should provide align=1 when inserting the textbox and so on - but I am not seeing any way to obtain that data, though the specification seems to imply that it may exist.

0 replies

JorjMcKie · 2023-02-25T16:45:00Z

JorjMcKie
Feb 25, 2023
Maintainer

I am using a separate text insertion step with the text rect being the same size of that of the original area, but my new text may not be the same pixel width/height. In the case where it is not exactly the same (i.e. the new text does not entirely fill up the area that the old text occupied), I would like to match the alignment of the previous text. If the original text is center justified I should provide align=1 when inserting the textbox and so on - but I am not seeing any way to obtain that data, though the specification seems to imply that it may exist.

It is not possible to find out a text's original alignment - sorry. Even the PDF spec does not have this concept, at least not in this way:
Yes, there are things like inter-word (Tw) and inter-character (Tc) spacing, but (1) these are different things, and (2) are no information availabe via text extraction.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Text Alignment Parsing #2256

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Text Alignment Parsing #2256

Uh oh!

JFulweber Feb 21, 2023

Replies: 3 comments

Uh oh!

JorjMcKie Feb 21, 2023 Maintainer

Uh oh!

JFulweber Feb 22, 2023 Author

Uh oh!

JorjMcKie Feb 25, 2023 Maintainer

JFulweber
Feb 21, 2023

JorjMcKie
Feb 21, 2023
Maintainer

JFulweber
Feb 22, 2023
Author

JorjMcKie
Feb 25, 2023
Maintainer