PhraseMatcher not matching correctly on attr when tokenization is customized #11951

NixBiks · 2022-12-08T10:18:54Z

NixBiks
Dec 8, 2022

I have an example where I have $ in my infixes tokenization rules. However then the PhraseMatcher fails to match on LOWER attr

How to reproduce the behaviour

import spacy
from spacy.matcher import PhraseMatcher
from spacy.util import compile_infix_regex

nlp = spacy.load("blank:en")
nlp.tokenizer.infix_finditer = compile_infix_regex(
    list(nlp.Defaults.infixes) + [r"[$]"]
).finditer

doc = nlp("It amounted to US$ 5")

# the following works where I match on the actual case of the token
matcher_working = PhraseMatcher(nlp.vocab, attr="LOWER", validate=True)
matcher_working.add("USD", [nlp("US$")])
assert len(matcher_working(doc)) == 1

# the following does not work where I match on the lowercase of the token
matcher = PhraseMatcher(nlp.vocab, attr="LOWER", validate=True)
matcher.add("USD", [nlp("us$")])
assert len(matcher(doc)) == 1

If I don't add [r"[$]"] to my infixes then it works fine. I assume that's a bug!?

Info about spaCy

spaCy version: 3.4.1
Platform: macOS-12.6.1-arm64-arm-64bit
Python version: 3.9.9

adrianeboyd · 2022-12-08T12:47:16Z

adrianeboyd
Dec 8, 2022

The tokenization for uppercase and lowercase may be different, so the tokenization for the provided pattern "US$" isn't the same as for "us$" and then the tokens don't line up when it's trying to match. A similar issue is #6994.

There is potential for even more variation in tokenization (especially with custom tokenizer settings), but it's probably sufficient for typical English settings/cases to add both nlp(text) and nlp(text.lower()) to your pattern to cover both possible tokenizations.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PhraseMatcher not matching correctly on attr when tokenization is customized #11951

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

PhraseMatcher not matching correctly on attr when tokenization is customized #11951

Uh oh!

NixBiks Dec 8, 2022

How to reproduce the behaviour

Info about spaCy

Replies: 1 comment

Uh oh!

adrianeboyd Dec 8, 2022

NixBiks
Dec 8, 2022

adrianeboyd
Dec 8, 2022