Understanding Lemmatisation #9149
-
Hello guys, I am trying to do a downstream task which demands lemmatisation as a preprocessing step. I tried spacy's lemmatizer but was a little confused as to how the lemmatisation actually happens. As you can see in the above image, I expect some words like Thanks, |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Please don't use screenshots of code, copy/paste it as text. If you need to set custom lemmas, the easiest way is to add a mapping/exception to an AttributeRuler. We could make the docs about this more clear, though it's a little hard because the AttributeRuler is more general than the lemmatizer, but doesn't cover all its use cases (like rule-based rather than lookup lemmatization). |
Beta Was this translation helpful? Give feedback.
-
To clarify about lemmatization, the lemmatizer is intended to remove inflectional morphology, not derivational morphology. Both Also be aware that the English rule-based lemmatizer uses the part-of-speech to determine how to analyze the word, so the results for individual words without context may not be particularly good. The part-of-speech of a word can still sometimes be ambiguous with context and the tagger will also make some mistakes, but in general the lemmatizer is meant to be applied to phrases or sentences rather than individual words. |
Beta Was this translation helpful? Give feedback.
To clarify about lemmatization, the lemmatizer is intended to remove inflectional morphology, not derivational morphology. Both
summery
anddetailings
are cases with derivational morphology rather than inflectional morphology. (detailings
is a little more complex: the-s
is inflectional but the-ing
is derivational sincedetailing
is a noun in this context.)Also be aware that the English rule-based lemmatizer uses the part-of-speech to determine how to analyze the word, so the results for individual words without context may not be particularly good. The part-of-speech of a word can still sometimes be ambiguous with context and the tagger will also make some mistakes, but in general the …