PhraseMatcher Incorrectly Matching String Twice with Single Char Difference #5300

fh-nfer · 2020-04-13T18:59:36Z

fh-nfer
Apr 13, 2020

I am creating a PhraseMatcher with different rules. I came across an instance where rules with different matching patters matched the same string. I am wondering if it's a hashing collision?

How to reproduce the behaviour

nlp = spacy.load("en_core_web_md", vectors=False)

matcher = spacy.matcher.PhraseMatcher(nlp.vocab)
    
patterns = [nlp.make_doc('lgmd1')]
matcher.add(str('0'), None, *patterns)

patterns = [nlp.make_doc('lgmd1g')]
matcher.add(str('1'), None, *patterns)

print(matcher(nlp('lgmd1g')))

Produces: [(746762829127501960, 0, 1), (5533571732986600803, 0, 2)]

Your Environment

Operating System: OSX Mojave 10.14.6
Python Version Used: 3.8.1
spaCy Version Used: 2.2.3
Environment Information: conda env

Answered by adrianeboyd

Apr 14, 2020

This is correctly showing one match for each pattern. I think the confusion is related to the tokenization. The default English tokenizer tokenizes 'lgmd1g' into two tokens:

['lgmd1', 'g']

When the PhraseMatcher searches the document nlp('lgmd1g'), the first pattern matches the first token (span with document slice [0:1] in the first match) and the second pattern matches a phrase containing two tokens ([0:2] in the second match).

View full answer

adrianeboyd · 2020-04-14T07:36:17Z

adrianeboyd
Apr 14, 2020

This is correctly showing one match for each pattern. I think the confusion is related to the tokenization. The default English tokenizer tokenizes 'lgmd1g' into two tokens:

['lgmd1', 'g']

When the PhraseMatcher searches the document nlp('lgmd1g'), the first pattern matches the first token (span with document slice [0:1] in the first match) and the second pattern matches a phrase containing two tokens ([0:2] in the second match).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PhraseMatcher Incorrectly Matching String Twice with Single Char Difference #5300

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

PhraseMatcher Incorrectly Matching String Twice with Single Char Difference #5300

Uh oh!

fh-nfer Apr 13, 2020

How to reproduce the behaviour

Your Environment

Replies: 1 comment

Uh oh!

adrianeboyd Apr 14, 2020

fh-nfer
Apr 13, 2020

adrianeboyd
Apr 14, 2020