The Japanese PDF is garbled. #345

masaoy0730 · 2025-03-01T13:00:42Z

masaoy0730
Mar 1, 2025

I try to extract text from the Japanese PDFs, but encounter that they are garbled.
I use pypdfium2 4.30.1, pdfium 133.0.6899.0.
The attached image is a reference.

As I use the same file on Azure AI Search, it is not garbled. So, I think that the PDFs do not have any problem.

mara004 · 2025-03-01T16:29:22Z

mara004
Mar 1, 2025
Maintainer

Please provide the PDF in question, and a clear comparison what's wrong in the output and what you would expect instead.
Also, don't embed screenshots of text – paste or attach the text itself instead.

That said, the pdfium update in v4.30.1 had introduced a text extraction bug (see #336) – I don't know whether this is the same issue or unrelated, but could you check if this works correctly with v4.30.0 or v5.0.0b1 ?

3 replies

masaoy0730 Mar 1, 2025
Author

Thank you for your comment.
Examples of garbled text are,
"Enterprise 㻿earch Engine", expected "Enterprise Search Engine", and
"「I㼀䛷、感動を、䛸䜒䛻。」をスローガン䛻、1954年䛾創業以降、東北䛾I㼀業界䛾リーダー䛸し䛶、東北電力グループを䛿じ䜑䛸する地域䛾お客さ䜎䛾情報システム䛾開発・保守・運用を支え、多く䛾お客さ䜎から䛾信頼や技術力を誇る株式会社トインクス。", expected "「ITで、感動を、ともに。」をスローガンに、1954年創業以降、東北IT業界リーダーとして、東北電力グループをはじめとする地域お客さま情報システム開発・保守・運用を支え、多くのお客さまから信頼や技術力を誇る株式会社トインクス。"

As I wrote, I use pypdfium2 4.30.1, pdfium 133.0.6899.0.
The attached file is an example of garbled PDFs.
case_neuron_toinx.pdf

mara004 Mar 1, 2025
Maintainer

You're right, thanks for sharing the PDF.
I confirmed this is not the same as #336, but an own issue, and it also happens with 4.30.0 or 5.0.0b1/latest.
As pypdfium2 is only forwarding the text provided by pdfium, could you please re-file an issue upstream?
https://issues.chromium.org/issues?q=componentid:1586257%2B%20is:open
Thanks.

masaoy0730 Mar 2, 2025
Author

Thank you for your reply.
I have submitted an issue as below.
https://issues.chromium.org/399937354 The Japanese PDF is garbled

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The Japanese PDF is garbled. #345

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

The Japanese PDF is garbled. #345

Uh oh!

Uh oh!

masaoy0730 Mar 1, 2025

Replies: 1 comment · 3 replies

Uh oh!

Uh oh!

mara004 Mar 1, 2025 Maintainer

Uh oh!

Uh oh!

masaoy0730 Mar 1, 2025 Author

Uh oh!

mara004 Mar 1, 2025 Maintainer

Uh oh!

masaoy0730 Mar 2, 2025 Author

masaoy0730
Mar 1, 2025

Replies: 1 comment 3 replies

mara004
Mar 1, 2025
Maintainer

masaoy0730 Mar 1, 2025
Author

mara004 Mar 1, 2025
Maintainer

masaoy0730 Mar 2, 2025
Author