Odd token matching lemma attribute behavior #5343

ingridgoh · 2020-04-23T08:43:04Z

ingridgoh
Apr 23, 2020

How to reproduce the behaviour

When lemma attribute is defined for keywords, the following sample sentence does not pick out the word "Liability".

Only when words in the sample sentence/pattern were modified did the matching work:

When the word "Proxy" is changed to a lowercase "proxy", the keyword "Liability" is identified.
When the word "Liability" in the sample sentence is changed to a lowercase "liability", the keyword "Liability" is identified
When [{"LEMMA": "liability"}] is switched to [{"LEMMA": "Liability"}], the keyword "Liability" is identified

from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)

pattern = [
    [{"LEMMA": "liability"}],
    [{"LEMMA": "liable"}]
]


[matcher.add('TERMS', None, p) for p in pattern]

textLine = nlp(", instructs the redemption of  certain funds and initial allocations and amends the Liability Proxy and Hedge Benchmark  and adjust the LDI Assets.")
matches = matcher(textLine)

for match_id, start, end in matches:
    span = textLine[start:end]
    print(span.text)

Your Environment

Operating System: Windows 10
Python Version Used: 3.7.1
spaCy Version Used: 2.2.3
Environment Information: -
spaCy model: en-core-web-sm v2.2.5 and v2.2.0

Answered by adrianeboyd

Apr 23, 2020

The lemmas depend on the POS, so whether the tagger thinks a word is a common noun (NOUN) or proper noun (PROPN) makes a difference in the lemmatization. The lemmas of proper nouns are left unchanged, while the common nouns will be lowercased and converted to singular.

The provided models are more likely to think that capitalized words are proper nouns, but it may also tag words like london as a proper noun because the training data is augmented to include some lowercased data to improve the results for more informal texts.

Examine the doc for the sentence and you can hopefully track down why particular words aren't matching:

for t in textLine:
    print(t.text, t.pos_, t.lemma_)

...
cert…

View full answer

pcoenen · 2020-04-23T09:35:39Z

pcoenen
Apr 23, 2020

I experienced similar errors :/

0 replies

adrianeboyd · 2020-04-23T11:28:50Z

adrianeboyd
Apr 23, 2020

The lemmas depend on the POS, so whether the tagger thinks a word is a common noun (NOUN) or proper noun (PROPN) makes a difference in the lemmatization. The lemmas of proper nouns are left unchanged, while the common nouns will be lowercased and converted to singular.

The provided models are more likely to think that capitalized words are proper nouns, but it may also tag words like london as a proper noun because the training data is augmented to include some lowercased data to improve the results for more informal texts.

Examine the doc for the sentence and you can hopefully track down why particular words aren't matching:

for t in textLine:
    print(t.text, t.pos_, t.lemma_)

...
certain ADJ certain
funds NOUN fund
and CCONJ and
initial ADJ initial
allocations NOUN allocation
and CCONJ and
amends VERB amend
the DET the
Liability PROPN Liability
Proxy PROPN Proxy
and CCONJ and
Hedge PROPN Hedge
Benchmark PROPN Benchmark
  SPACE  
and CCONJ and
adjust VERB adjust
the DET the
LDI PROPN LDI
Assets PROPN Assets
. PUNCT .

The proper noun vs. common noun distinction is pretty difficult for the tagger, so it may be hard to get the results you want using LEMMA. If you're only searching for Liability and not Liabilities, then LOWER might be a better attribute to match on than LEMMA.

0 replies

ingridgoh · 2020-04-28T05:35:53Z

ingridgoh
Apr 28, 2020
Author

Unfortunately we will need to search for all variations of the word. Is there a workaround for this? The only approach I can think of for now is to define two separate patterns:

...
"liability": [{"LEMMA": "liability"}],
"Liability": [{"LEMMA": "Liability"}]
...

However this would probably take a longer time and we have around 60+ terms for matching.

0 replies

svlandeg · 2020-10-16T13:17:30Z

svlandeg
Oct 16, 2020

As Adriane pointed out, you could also use LOWER as attribute to match on, though that will result in other missed terms such as "liabilities". I'm afraid a rule-based system like this will always have its disadvantages, and requires some tuning and experimentation.

Given Adriane's explanation on the influence of tagger results on the lemmatization, and the fact that there is not really an action point for us left, I will tentavily close this issue. Let us know if you experience further issues that can't be attributed to different tagging results though!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Odd token matching lemma attribute behavior #5343

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Odd token matching lemma attribute behavior #5343

Uh oh!

ingridgoh Apr 23, 2020

How to reproduce the behaviour

Your Environment

Replies: 4 comments

Uh oh!

Uh oh!

pcoenen Apr 23, 2020

Uh oh!

adrianeboyd Apr 23, 2020

Uh oh!

ingridgoh Apr 28, 2020 Author

Uh oh!

svlandeg Oct 16, 2020

ingridgoh
Apr 23, 2020

pcoenen
Apr 23, 2020

adrianeboyd
Apr 23, 2020

ingridgoh
Apr 28, 2020
Author

svlandeg
Oct 16, 2020