PhraseMatcher Incorrectly Matching String Twice with Single Char Difference #5300
-
|
I am creating a PhraseMatcher with different rules. I came across an instance where rules with different matching patters matched the same string. I am wondering if it's a hashing collision? How to reproduce the behaviournlp = spacy.load("en_core_web_md", vectors=False)
matcher = spacy.matcher.PhraseMatcher(nlp.vocab)
patterns = [nlp.make_doc('lgmd1')]
matcher.add(str('0'), None, *patterns)
patterns = [nlp.make_doc('lgmd1g')]
matcher.add(str('1'), None, *patterns)
print(matcher(nlp('lgmd1g')))Produces: [(746762829127501960, 0, 1), (5533571732986600803, 0, 2)] Your Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
This is correctly showing one match for each pattern. I think the confusion is related to the tokenization. The default English tokenizer tokenizes When the |
Beta Was this translation helpful? Give feedback.
This is correctly showing one match for each pattern. I think the confusion is related to the tokenization. The default English tokenizer tokenizes
'lgmd1g'into two tokens:When the
PhraseMatchersearches the documentnlp('lgmd1g'), the first pattern matches the first token (span with document slice[0:1]in the first match) and the second pattern matches a phrase containing two tokens ([0:2]in the second match).