`LIKE_NUM` behavior is inconsistent for English. #10498

edemattos · 2022-03-14T20:04:55Z

edemattos
Mar 14, 2022

I am interested in using LIKE_NUM to recover numbers before the NER component, but it is giving mixed results. Is it expected that recovering such number expressions should be left to the NER, or should the Matcher be able to handle these?

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")

text = [
    "three hundred and sixty five days",
    "fifty days",
    "45,646 days",
    "45, 646 days",
    "3 years is 1,095 days",
]

matcher = Matcher(nlp.vocab)
pattern = [{'LIKE_NUM': True}, {'OP': "+"}, {'LOWER': 'days'}]
matcher.add('num', [pattern], greedy="LONGEST")

for doc in nlp.pipe(text, disable=['ner']):

    print(f"# {doc.text}")

    for token in doc:
        print(f"{token.text}\t{token.tag_}\tlike_num={token.like_num}")

    matches = matcher(doc)

    for m in matches:
        print(f"MATCH: {doc[m[1]:m[2]]}")
    if not matches:
        print("MATCH: none")

    print()

Output:

# three hundred and sixty five days
three	CD	like_num=True
hundred	CD	like_num=True
and	CC	like_num=False
sixty	CD	like_num=True
five	CD	like_num=True
days	NNS	like_num=False
MATCH: three hundred and sixty five days ✅

# fifty days
fifty	CD	like_num=True
days	NNS	like_num=False
MATCH: none ❌

# 45,646 days
45,646	CD	like_num=True
days	NNS	like_num=False
MATCH: none ❌

# 45, 646 days
45	CD	like_num=True
,	,	like_num=False
646	CD	like_num=True
days	NNS	like_num=False
MATCH: 45, 646 days ✅

# 3 years is 1,095 days
3	CD	like_num=True
years	NNS	like_num=False
is	VBZ	like_num=False
1,095	CD	like_num=True
days	NNS	like_num=False
MATCH: 3 years is 1,095 days ❌

Info about spaCy

spaCy version: 3.2.2
Platform: macOS-12.0.1-arm64-arm-64bit
Python version: 3.9.10
Pipelines: en_core_web_sm (3.2.0)

Answered by adrianeboyd

Mar 15, 2022

That's right, this is the expected behavior for that pattern. You might want to match any token at a certain position ({} is also a valid token dict) or one or more tokens between two other tokens, so I don't think a warning or error makes sense here.

View full answer

edemattos · 2022-03-14T20:57:03Z

edemattos
Mar 14, 2022
Author

Removing {"OP": "+"} makes it more consistent but loses the ability to recognize large numbers and digits with spaces after a separator, though a case could be made for rejecting the latter.

# three hundred and sixty five days
three	CD	like_num=True
hundred	CD	like_num=True
and	CC	like_num=False
sixty	CD	like_num=True
five	CD	like_num=True
days	NNS	like_num=False
MATCH: five days ❌

# fifty days
fifty	CD	like_num=True
days	NNS	like_num=False
MATCH: fifty days ✅

# 45,646 days
45,646	CD	like_num=True
days	NNS	like_num=False
MATCH: 45,646 days ✅

# 45, 646 days
45	CD	like_num=True
,	,	like_num=False
646	CD	like_num=True
days	NNS	like_num=False
MATCH: 646 days ❌

# 3 years is 1,095 days
3	CD	like_num=True
years	NNS	like_num=False
is	VBZ	like_num=False
1,095	CD	like_num=True
days	NNS	like_num=False
MATCH: 1,095 days ✅

0 replies

edemattos · 2022-03-15T10:44:14Z

edemattos
Mar 15, 2022
Author

Apologies, I realize now that OP should not have been its own pattern, which means that the output in the second comment is expected behavior and makes sense from the perspective of identifying strictly like_num tokens, even if it means cutting off larger numbers like three hundred and sixty five.

The docs are pretty clear about how the OP pattern works but I think this is an easy slip-up to make and might be worth adding an error check for? Or, if OP as its own pattern like I've done above is allowed and intended to function like .+, maybe that could be clarified in the docs?

0 replies

adrianeboyd · 2022-03-15T16:01:16Z

adrianeboyd
Mar 15, 2022

That's right, this is the expected behavior for that pattern. You might want to match any token at a certain position ({} is also a valid token dict) or one or more tokens between two other tokens, so I don't think a warning or error makes sense here.

2 replies

edemattos Mar 16, 2022
Author

Thanks! {} as a wildcard makes sense but wasn't obvious to me at first, and the doc page makes it seem like patterns must come from the listed attributes.

kinghuang Mar 25, 2022

All these years of using spaCy, and I had no idea I could use {} as a wildcard token! 🤯

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

`LIKE_NUM` behavior is inconsistent for English. #10498

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

LIKE_NUM behavior is inconsistent for English. #10498

Uh oh!

Uh oh!

edemattos Mar 14, 2022

Info about spaCy

Replies: 3 comments · 2 replies

Uh oh!

edemattos Mar 14, 2022 Author

Uh oh!

Uh oh!

edemattos Mar 15, 2022 Author

Uh oh!

adrianeboyd Mar 15, 2022

Uh oh!

edemattos Mar 16, 2022 Author

Uh oh!

kinghuang Mar 25, 2022

`LIKE_NUM` behavior is inconsistent for English. #10498

edemattos
Mar 14, 2022

Replies: 3 comments 2 replies

edemattos
Mar 14, 2022
Author

edemattos
Mar 15, 2022
Author

adrianeboyd
Mar 15, 2022

edemattos Mar 16, 2022
Author