Odd token matching lemma attribute behavior #5343
-
How to reproduce the behaviourWhen lemma attribute is defined for keywords, the following sample sentence does not pick out the word "Liability". Only when words in the sample sentence/pattern were modified did the matching work:
Your Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 4 comments
-
|
I experienced similar errors :/ |
Beta Was this translation helpful? Give feedback.
-
|
The lemmas depend on the POS, so whether the tagger thinks a word is a common noun (NOUN) or proper noun (PROPN) makes a difference in the lemmatization. The lemmas of proper nouns are left unchanged, while the common nouns will be lowercased and converted to singular. The provided models are more likely to think that capitalized words are proper nouns, but it may also tag words like Examine the doc for the sentence and you can hopefully track down why particular words aren't matching: for t in textLine:
print(t.text, t.pos_, t.lemma_)The proper noun vs. common noun distinction is pretty difficult for the tagger, so it may be hard to get the results you want using |
Beta Was this translation helpful? Give feedback.
-
|
Unfortunately we will need to search for all variations of the word. Is there a workaround for this? The only approach I can think of for now is to define two separate patterns: However this would probably take a longer time and we have around 60+ terms for matching. |
Beta Was this translation helpful? Give feedback.
-
|
As Adriane pointed out, you could also use Given Adriane's explanation on the influence of tagger results on the lemmatization, and the fact that there is not really an action point for us left, I will tentavily close this issue. Let us know if you experience further issues that can't be attributed to different tagging results though! |
Beta Was this translation helpful? Give feedback.
The lemmas depend on the POS, so whether the tagger thinks a word is a common noun (NOUN) or proper noun (PROPN) makes a difference in the lemmatization. The lemmas of proper nouns are left unchanged, while the common nouns will be lowercased and converted to singular.
The provided models are more likely to think that capitalized words are proper nouns, but it may also tag words like
londonas a proper noun because the training data is augmented to include some lowercased data to improve the results for more informal texts.Examine the doc for the sentence and you can hopefully track down why particular words aren't matching: