Skip to content

Conversation

@shreeMahadikGit
Copy link

Fixes this issue:

#20225

Copy link
Contributor

@calixteman calixteman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, it won't work with a pdf containing "Hello. World" and a query equals to "o. w" (fyi it's ok in Acrobat).
I think a fix could be to add optional white spaces around group of punctuation signs which induces to not have [ ]* between consecutive punctuation signs.
It'd lead to update SPECIAL_CHARS_REG_EXP but you've to take care to the case of . and ?.

@shreeMahadikGit
Copy link
Author

shreeMahadikGit commented Oct 16, 2025

Handled the Suggested test case. Also Attaching the PDF with the test case:

Morse (1) (1).pdf

const DIACRITICS_REG_EXP = /\p{M}+/gu;
const SPECIAL_CHARS_REG_EXP =
/([.*+?^${}()|[\]\\])|(\p{P})|(\s+)|(\p{M})|(\p{L})/gu;
/([*+^${}()|[\]\\])|((?:[.?]|\p{P})+)|(\s+)|(\p{M})|(\p{L})/gu;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want to not capture . or ? ?

return `[ ]*\\${p1}[ ]*`;
// Escaped metacharacters like . * + ? ...
// Allow spaces around them ONLY if the user typed spaces.
return queryHasWhitespace ? `[ ]*\\${p1}[ ]*` : `\\${p1}`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want to do that ?

return `[ ]*${p2}[ ]*`;
// Punctuation: allow optional spaces ONLY if the user typed spaces.
// p2 is a *run* of punctuation; escape it as a whole.
const escapedRun = escapeForRegex(p2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You just have to replace the . and the ? by . and ?

// Punctuation: allow optional spaces ONLY if the user typed spaces.
// p2 is a *run* of punctuation; escape it as a whole.
const escapedRun = escapeForRegex(p2);
return queryHasWhitespace ? `[ ]*${escapedRun}[ ]*` : `${escapedRun}`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again I don't understand why you want to make a difference between hasWhitespace and !hasWhitespace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants