Uninterpretable content in content stream when a PDF was created with text using PyMuPDF #1394

meghanaviyyapu · 2021-11-12T06:46:48Z

meghanaviyyapu
Nov 12, 2021

I have created a PDF file and added some text using PyMuPDF library

import fitz
doc=fitz.open()
page=doc.newPage()
where=fitz.Point(200,10)
text="ABC"
page.insert_text(where,text)
path="F:\PyMuPDF_test.pdf"
doc.save(path)

After opening the created PDF file and checking the contents using doc.xref_stream().decode(), I found [<414243>]. Can you let me know what these values are.

Answered by JorjMcKie

Nov 12, 2021

A typical "Discussions" item - no issue.

Content streams are written in PDF's mini-language by which the appearance of pages, annotations and some other object is defined.
The syntax is explained on pages 643 of https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf.
Dozens of pages - too much to explain here. View this to be some type of source code of a programming language that you do not know.
On top, content streams are usually compressed, so you won't see that source code in ASCII.

To give you some start at least:

q - put current graphic state on the stack
BT - begin a text object
1 0 0 1 200 832 Tm - defines a text matrix. Numbers before Tm are the six matrix …

View full answer

JorjMcKie · 2021-11-12T08:54:12Z

JorjMcKie
Nov 12, 2021
Maintainer

A typical "Discussions" item - no issue.

Content streams are written in PDF's mini-language by which the appearance of pages, annotations and some other object is defined.
The syntax is explained on pages 643 of https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf.
Dozens of pages - too much to explain here. View this to be some type of source code of a programming language that you do not know.
On top, content streams are usually compressed, so you won't see that source code in ASCII.

To give you some start at least:

q - put current graphic state on the stack
BT - begin a text object
1 0 0 1 200 832 Tm - defines a text matrix. Numbers before Tm are the six matrix parameters. This one will position us 200 points from left and 832 points above bottom of the page.
/helv 11 Tf - invoke text font named "helv" with a fontsize of 11
[<414243>] TJ - output 3 characters in hex format, here "ABC"
ET - end text object
Q pull previous graphics state from stack

1 reply

meghanaviyyapu Nov 12, 2021
Author

Thanks for the explanation. Can you let me know why in some cases there are special characters instead of characters in hex format in content stream.

sample-pdf-with-images.pdf

JorjMcKie · 2021-11-12T09:32:04Z

JorjMcKie
Nov 12, 2021
Maintainer

Can you let me know why in some cases there are special characters instead of characters in hex format in content stream.

You are free to choose between hex and non-hex sometimes - depends on the font in use. For some fonts you cannot put the output characters ("ABC") in the contents.
Instead you must use the glyph number for each character and this must be given as an integer in hex format. The translation between the character's unicode and its glyph number is then taken over by the font's font program.

1 reply

meghanaviyyapu Nov 12, 2021
Author

Does PyMuPDF have the capability of giving xref of text in a PDF.

JorjMcKie · 2021-11-12T09:38:50Z

JorjMcKie
Nov 12, 2021
Maintainer

Does PyMuPDF have the capability of giving xref of text in a PDF.

Text in PDF does not have an XREF at all - nothing will be able to provide one.
Only the /Contents object has an xref - there may be several of those for a page.

0 replies

JorjMcKie · 2021-11-12T09:41:38Z

JorjMcKie
Nov 12, 2021
Maintainer

What are you trying to achieve?
If you think you can reconstruct text by interpreting the contetns source yourself: ... give up!
That's impossible ... or you rewrite a PDF viewer like Adobe Acrobat or MuPDF yourself.

4 replies

meghanaviyyapu Nov 12, 2021
Author

I am trying to add tags to PDF

For adding /K tag I need the text in the content stream so I was trying to find if there is any specific way to get a particular text from PDF.

JorjMcKie Nov 12, 2021
Maintainer

I see. That will not work I am afraid - as I explained.

meghanaviyyapu Nov 12, 2021
Author

Yeah I understood. Do you have any idea on any python libraries that can be used for adding tags like this?

JorjMcKie Nov 12, 2021
Maintainer

I am not aware of any, but maybe you find something here.

Uninterpretable content in content stream when a PDF was created with text using PyMuPDF #1394

Uh oh!

meghanaviyyapu Nov 12, 2021

Replies: 4 comments · 6 replies

Uh oh!

JorjMcKie Nov 12, 2021 Maintainer

Uh oh!

meghanaviyyapu Nov 12, 2021 Author

Uh oh!

JorjMcKie Nov 12, 2021 Maintainer

Uh oh!

meghanaviyyapu Nov 12, 2021 Author

Uh oh!

JorjMcKie Nov 12, 2021 Maintainer

Uh oh!

JorjMcKie Nov 12, 2021 Maintainer

Uh oh!

meghanaviyyapu Nov 12, 2021 Author

Uh oh!

JorjMcKie Nov 12, 2021 Maintainer

Uh oh!

meghanaviyyapu Nov 12, 2021 Author

Uh oh!

JorjMcKie Nov 12, 2021 Maintainer

meghanaviyyapu
Nov 12, 2021

Replies: 4 comments 6 replies

JorjMcKie
Nov 12, 2021
Maintainer

meghanaviyyapu Nov 12, 2021
Author

JorjMcKie
Nov 12, 2021
Maintainer

meghanaviyyapu Nov 12, 2021
Author

JorjMcKie
Nov 12, 2021
Maintainer

JorjMcKie
Nov 12, 2021
Maintainer

meghanaviyyapu Nov 12, 2021
Author

JorjMcKie Nov 12, 2021
Maintainer

meghanaviyyapu Nov 12, 2021
Author

JorjMcKie Nov 12, 2021
Maintainer