Skip to content

[Bug]: Text extraction / text layer rendering for hebrew content #20336

@gwtdevlpr

Description

@gwtdevlpr

Attach (recommended) or Link to PDF file

testPdf.pdf

Web browser and its version

Chrome 139, Firefox 131

Operating system and its version

Windows 11, macOS 15, Ubuntu 22.04

PDF.js version

v5.3.31.a1

Is the bug present in the latest PDF.js version?

Yes

Is a browser extension

No

Steps to reproduce the problem

  1. Open the attached PDF (contains Hebrew content).

  2. Inspect the text layer using the browser developer tools (.textLayer spans).

  3. On page 1, note that the phrase visually rendered as אישור אגודה לחתימת is extracted in reverse order in the text layer while performing search (e.g., לחתימת אישור אגודה).

  4. Try searching in the PDF viewer for חוזה חכירה

    On page 1 → "No results found".

  5. Searching for חכירה חוזה yields result in page 2 where as the rendered text in page 2 is חוזה חכירה.

What is the expected behavior?

The text layer should consistently preserve the correct order of Hebrew text across all pages.

Search should work reliably on all pages for Hebrew text.

What went wrong?

The text layer for Hebrew content is inconsistent with the visual rendering. While the text displays correctly on the canvas, the extracted text in the text layer is sometimes reversed or altered. This causes search, copy-paste and text extraction features to fail on certain pages

Link to a viewer

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions