Skip to content

Conversation

@fukudasjp
Copy link
Contributor

@fukudasjp fukudasjp commented Nov 10, 2021

What do these changes do/fix?

This depends on #237.

This PR is to add PDFViewerHighlighter component which is a highlight layer for PDF based on text spans on fields in search result document. This also implements PDFViewerWithHighlight which is PDFViewerHighlighter integrated with PDFViewer.

The implementation includes a logic to find calculate bboxes for highlighting. Overview of the logic is in README.md.

  • Add generic utilities used by the logic
    • feat: add types and common utilities 45e5da8
  • Add an option to extract HTML source in bbox from HTML field to processDoc
    • feat: add option for bbox text to processDoc c2ab1f6
  • Implement the logic to calculate bboxes for highlighting (please refer to README.md to get high-level view of the logic)
    • feat: add text layer classes 20e4866
      • Implement TextLayouts
    • feat: add highlighting logic and README 839f948
      • Implement textBoxMapping and Highlighter
  • Implement Rect components and storybook
    • feat: add PDF highlight component 67c0ac9

How do you test/verify these changes?

  • Open the storybook DocumentPreview > components > PdfViewerWithHighlight > with text selection,
  • In the right pane, select a field except for html from the dropdown on the top
  • Select a range on the text
  • Then, verify that the text on PDF is highlighted

Have you documented your changes (if necessary)?

Are there any breaking changes included in this pull request?

@jhpedemonte
Copy link
Member

How do you test/verify these changes?

Can you please add to this section?


Did a quick test in Storybook. I see the new PdfViewerWithHighlight section and I can select text on the right and see it appear on the left when selecting text[0] (I assume this is what it is supposed to show off). However, I don't see the same when selecting text in either header[0] or footer[0] -- should those work, too?

Copy link
Member

@jhpedemonte jhpedemonte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the major things you were investigating was highlighting on Japanese text, correct? You should include a sample doc to show this off (similar to __fixtures__/Art Effects.pdf).

@fukudasjp
Copy link
Contributor Author

Did a quick test in Storybook. I see the new PdfViewerWithHighlight section and I can select text on the right and see it appear on the left when selecting text[0] (I assume this is what it is supposed to show off). However, I don't see the same when selecting text in either header[0] or footer[0] -- should those work, too?

Thank you for pointing this out. They should work. I found some issue with them and will fix.

@fukudasjp
Copy link
Contributor Author

One of the major things you were investigating was highlighting on Japanese text, correct? You should include a sample doc to show this off (similar to fixtures/Art Effects.pdf).

I just added a small Japanese PDF sample.

Base automatically changed from feat/redner-pdf-text to master December 6, 2021 04:41
@fukudasjp fukudasjp changed the base branch from master to fix-yarn-lock-build-error December 6, 2021 13:01
@fukudasjp fukudasjp changed the base branch from fix-yarn-lock-build-error to master December 6, 2021 13:02
@fukudasjp fukudasjp marked this pull request as ready for review December 6, 2021 13:02
@jhpedemonte
Copy link
Member

  1. The story shows selecting text and having it highlighted on the PDF. Does the reverse work? I'm thinking of scenario where user is presented with a PDF and selects some text and app needs to figure out which text-based enrichment it belongs to.
  2. Have you tested scanned (non-programmatic) PDFs yet, which have gone through OCR? I'm wondering how well this code works in those cases.
    • Also as mentioned, we'll need to consider scanned documents which are skewed and how we handle highlighting there. I think the suggestion at the time (from Mauricio?) was that we may have enough info to rotate the original document so it doesn't appear skewed any more. In which case, highlighting shouldn't be an issue.

Copy link
Member

@jhpedemonte jhpedemonte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall and functions well in Storybook. Found a few minor things. With those fixed, approved.


/**
* Flag to whether or not to use bbox information from html field in the document.
* True by default. This is for testing and debugging purpose.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should better denote that these properties aren't meant to be used normally by end users. Best I can think is to prepend an underscore to the name:

  _useHtmlBbox?: boolean;

@@ -0,0 +1,25 @@
.withTextSelection {
display: flex;
height: 800px;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this have an explicit height? Is it possible to make it relative?


.highlight {
opacity: 0.4;
background: rgba(255, 64, 128, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check what Carbon says about highlighting text and which colors/opacity to use.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only thing I found so far is $highlight from https://www.carbondesignsystem.com/guidelines/color/usage. That's fine if there's a single highlight in the component, but we'll need more guidance if we have to handle multiple highlight colors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I update the code to use the $highlight color so far.

const heightB = bottomB - topB;

// compare height ratio
const OVERLAP_RATIO = 0.8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Move this constant out of the function and to the top of the file.
  2. Also, what is the purpose of this constance? How was it calculated? Add a description comment.

*/
export function nonEmpty<T>(value: T | null | undefined): value is T {
return value !== null && value !== undefined;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is generic and not specific to PdfViewerHighlight. Move it up to the top-level utils/.

}

/**
* Text box mapping
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't really explain this component. What does this map between?

const MAX_HISTORY = 3;

export type TextMatch = {
/** matched text span */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For single line comments like these, let's keep with convention of just starting with // and leave /** for multi-line comments (here and elsewhere in PR).

(Wonder if there's an eslint/prettier plugin for this?)

Comment on lines 4 to 7
export const LEFT = 0;
export const TOP = 1;
export const RIGHT = 2;
export const BOTTOM = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are these used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They were used in dom.ts. I refactored the code to use destructuring assignment.

Comment on lines 25 to 26
opacity: 0.5;
background: rgba(0, 0, 255, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: need to check with Design about use of highlight color/opacity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Use the Carbon color with opacity as a default and check with design.

@fukudasjp
Copy link
Contributor Author

The story shows selecting text and having it highlighted on the PDF. Does the reverse work? I'm thinking of scenario where user is presented with a PDF and selects some text and app needs to figure out which text-based enrichment it belongs to.

Not yet, but I agree that it has to be considered to make this work as editor, too. I'm thinking to add a selection handler such as onSelect({ field, fieldIndex, span }) to the component

Have you tested scanned (non-programmatic) PDFs yet, which have gone through OCR? I'm wondering how well this code works in those cases.

I tried scanned document. Currently no highlight is shown on scanned document and need to fix it. On scanned PDF, highlight relies on bboxes in HTML field. With a document I tried, the bboxes spans on multiple lines and highlight were not accurate. Smaller bboxes are required for more accuracy.

@fukudasjp fukudasjp merged commit 5e06d62 into master Dec 9, 2021
@fukudasjp fukudasjp deleted the feat/highligh-on-pdf branch December 9, 2021 05:12
jhpedemonte added a commit that referenced this pull request Jan 20, 2022
* origin/master:
  chore: publish v1.5.0-beta.10 [ci skip]
  build: support Node 16 (#268)
  chore: publish v1.5.0-beta.9 [ci skip]
  chore: create a global rollup cmd (#262)
  chore: publish v1.5.0-beta.8 [ci skip]
  feat: document provider interface (#249)
  fix: mitigate overlapped or segmented highlight on PDF (#252)
  chore: publish v1.5.0-beta.7 [ci skip]
  fix: fix PDF highlight misalignment issue (#253)
  chore: publish v1.5.0-beta.6 [ci skip]
  chore: update Carbon to 10.46.0/7.46.0 (#254)
  chore: update Carbon to 10.46.0/7.46.0 (#250)
  chore: publish v1.5.0-beta.5 [ci skip]
  feat: add PDF viewer with highlighting (#238)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants