How to use lemmatizers in languages without packages #11047
-
Hi, Some of the languages don't have models but they have tokenizers, lemmatization rules and stopwords. Irish is one of them. I tried to use Irish lemmatizer like this:
it prints Other lemmatizers generally print the same word for the ones that don't exist in lemma lookup table. Do I miss something here? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
There are two issues here. The general issue you're asking about is that when you call
That covers the general case. The second issue, which is specific to the Irish lemmatizer, is that, based on the implementation, it seems like the lemmatizer doesn't do anything without part of speech information, but we don't have a trained tagger, so you'd have to supply your own tagger for it to work. Otherwise the lemma will always be the lowercase version of a word. I think other lemmatizers usually use a fallback strategy in the absence of POS data. |
Beta Was this translation helpful? Give feedback.
There are two issues here.
The general issue you're asking about is that when you call
Irish()
(or the same for other languages) you get a blank pipeline, but the Lemmatizer is a component you have to add. You can add it like this:That covers the general case.
The second issue, which is specific to the Irish lemmatizer, is that, based on the implementation, it seems like the lemmatizer doesn't do anything without part of speech information, but we don't have a trained tagger, so you'd have to supply your own tagge…