- 
                Notifications
    You must be signed in to change notification settings 
- Fork 10.5k
Use ActualText when getting the text for the text layer #20014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | 
|---|---|---|
|  | @@ -2406,6 +2406,7 @@ class PartialEvaluator { | |
| transform: null, | ||
| fontName: null, | ||
| hasEOL: false, | ||
| span: "", | ||
| }; | ||
|  | ||
| // Use a circular buffer (length === 2) to save the last chars in the | ||
|  | @@ -3070,6 +3071,19 @@ class PartialEvaluator { | |
| textContentItem.str.length = 0; | ||
| } | ||
|  | ||
| function replaceTextContentBySpan() { | ||
| const { span, str } = textContentItem; | ||
| if (!span) { | ||
| return; | ||
| } | ||
| textContentItem.span = ""; | ||
| if (/^\s+$/.test(span)) { | ||
| return; | ||
| } | ||
| str.length = 0; | ||
| str.push(span); | ||
| } | ||
|  | ||
| function enqueueChunk(batch = false) { | ||
| const length = textContent.items.length; | ||
| if (length === 0) { | ||
|  | @@ -3446,6 +3460,11 @@ class PartialEvaluator { | |
| return; | ||
| case OPS.beginMarkedContent: | ||
| flushTextContentItem(); | ||
| if (args[0]?.name === "Span") { | ||
| textContentItem.span = stringToPDFString( | ||
| args[1]?.get("ActualText") || "" | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this PR also fix #12237 perhaps? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This would be a start at fixing that issue. This is the first step, getting this  | ||
| ); | ||
| } | ||
| 
      Comment on lines
    
      +3463
     to 
      +3467
    
   There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure what this addition does. This is for a  | ||
| if (includeMarkedContent) { | ||
| markedContentData.level++; | ||
|  | ||
|  | @@ -3457,6 +3476,11 @@ class PartialEvaluator { | |
| break; | ||
| case OPS.beginMarkedContentProps: | ||
| flushTextContentItem(); | ||
| if (args[0]?.name === "Span") { | ||
| textContentItem.span = stringToPDFString( | ||
| args[1]?.get("ActualText") || "" | ||
| ); | ||
| } | ||
| if (includeMarkedContent) { | ||
| markedContentData.level++; | ||
|  | ||
|  | @@ -3474,6 +3498,7 @@ class PartialEvaluator { | |
| } | ||
| break; | ||
| case OPS.endMarkedContent: | ||
| replaceTextContentBySpan(); | ||
| flushTextContentItem(); | ||
| if (includeMarkedContent) { | ||
| if (markedContentData.level === 0) { | ||
|  | ||
| Original file line number | Diff line number | Diff line change | 
|---|---|---|
|  | @@ -726,3 +726,4 @@ | |
| !chrome-text-selection-markedContent.pdf | ||
| !bug1963407.pdf | ||
| !issue19517.pdf | ||
| !issue20007.pdf | ||
| Original file line number | Diff line number | Diff line change | 
|---|---|---|
|  | @@ -3923,6 +3923,20 @@ Caron Broadcasting, Inc., an Ohio corporation (“Lessee”).`) | |
| expect(items[1].fontName).not.toEqual(items[0].fontName); | ||
| }); | ||
|  | ||
| it("gets text content from /ActualText", async function () { | ||
| const loadingTask = getDocument(buildGetDocumentParams("issue20007.pdf")); | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't really know why, but the unit test failure suggests that this can't be loaded: 
 Moreover, is the movement in the reference test expected? | ||
| const pdfDoc = await loadingTask.promise; | ||
| const pdfPage = await pdfDoc.getPage(1); | ||
|  | ||
| const { items } = await pdfPage.getTextContent({ | ||
| disableNormalization: true, | ||
| }); | ||
| const text = mergeText(items); | ||
| expect(text).toEqual("The quick brown fox jumps over the lazy dog"); | ||
|  | ||
| await loadingTask.destroy(); | ||
| }); | ||
|  | ||
| it("gets empty structure tree", async function () { | ||
| const tree = await page.getStructTree(); | ||
|  | ||
|  | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the
args[1]?.get("ActualText")be exposed in the getOperatorList result as well?e.g. something like this
in
pdf.js/src/core/evaluator.js
Lines 2300 to 2303 in d2a6638
Not sure whether its a breaking change, but it's crucial for reconstructing content (e.g. svg) from the results of getOperatorList() when not using getTextContent().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you file a bug and explain why it'd be useful to have such a feature ?
Could it help to fix an existing issue in the current viewer ?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@calixteman Ok, I'll open a another ticket for it. I don't think it's related to the current issue with the viewer.
Actually I opened the original ticket because I got wrong text from getOperatorList(), and the viewer is also affected so I used it to open the ticket as it's easier to reproduce than a code snippet.
I was actually building a pdf -> svg conversion tool with getOperatorList(). I found getTextContent() to be not useful - it only extracts text, and the shape info can only be obtained from getOperatorList(), and there's no easy way to interweave the text+shape back into correct order from the results of both functions, so I ditched getTextContent() and only use getOperatorList() to also obtain text.