Spacy Matcher doesn't match with "cannot" #10122

bwang482 · 2022-01-24T04:57:33Z

bwang482
Jan 24, 2022

I am copying and testing the simple example from the documentation, and replacing "hello" with "cannot". Now the matcher returns nothing (it worked with "hello").

import spacy
from spacy.matcher import Matcher

nlp = spacy.blank("en")
matcher = Matcher(nlp.vocab)
# Add match ID "HelloWorld" with no callback and one pattern
pattern = [{"LOWER": "cannot"}, {"LOWER": "world"}]
matcher.add("HelloWorld", [pattern])

doc = nlp("Hello, world! Cannot world!")
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

PhraseMatcher doesn't seem to have a problem with "cannot world" but what is so special about the token "cannot" that cannot be (by Matcher) matched? Or am I missing something here?

Answered by Pandalei97

Jan 24, 2022

Hi,

If you take a look at the tokens in the doc, you will see that "cannot" is separated into 2 tokens: 'can' and 'not'.

Since Matcher patterns are descriptions of tokens to find, your pattern will search for a single token 'cannot', followed by 'world'.

This pattern works in your case : pattern = [{"LOWER": "can"}, {"LOWER": "not"}, {"LOWER": "world"}]

The PhraseMatcher works because il will tokenize the text internally and produce patterns which respect the tokenizer behaviour.

When writing patterns for Matcher, you need to pay attention to the tokenization, especially when it comes to compound words.

View full answer

Pandalei97 · 2022-01-24T07:50:41Z

Pandalei97
Jan 24, 2022

Hi,

If you take a look at the tokens in the doc, you will see that "cannot" is separated into 2 tokens: 'can' and 'not'.

Since Matcher patterns are descriptions of tokens to find, your pattern will search for a single token 'cannot', followed by 'world'.

This pattern works in your case : pattern = [{"LOWER": "can"}, {"LOWER": "not"}, {"LOWER": "world"}]

The PhraseMatcher works because il will tokenize the text internally and produce patterns which respect the tokenizer behaviour.

When writing patterns for Matcher, you need to pay attention to the tokenization, especially when it comes to compound words.

0 replies

bwang482 · 2022-01-24T19:43:14Z

bwang482
Jan 24, 2022
Author

Thanks very much @Pandalei97 ! Great point!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Spacy Matcher doesn't match with "cannot" #10122

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Spacy Matcher doesn't match with "cannot" #10122

Uh oh!

Uh oh!

bwang482 Jan 24, 2022

Replies: 2 comments

Uh oh!

Pandalei97 Jan 24, 2022

Uh oh!

bwang482 Jan 24, 2022 Author

bwang482
Jan 24, 2022

Pandalei97
Jan 24, 2022

bwang482
Jan 24, 2022
Author