How to use lemmatizers in languages without packages #11047

zafercavdar · 2022-06-28T19:35:55Z

zafercavdar
Jun 28, 2022

Hi,

Some of the languages don't have models but they have tokenizers, lemmatization rules and stopwords. Irish is one of them. I tried to use Irish lemmatizer like this:

from spacy.lang.ga import Irish

nlp = Irish()
print([token.lemma_ for token in nlp("Is mise an t-aon duine")])

it prints
['', '', '', '', '', '', '']

Other lemmatizers generally print the same word for the ones that don't exist in lemma lookup table.

Do I miss something here?

Answered by polm

Jun 29, 2022

There are two issues here.

The general issue you're asking about is that when you call Irish() (or the same for other languages) you get a blank pipeline, but the Lemmatizer is a component you have to add. You can add it like this:

from spacy.lang.ga import Irish

nlp = Irish()
lemmatizer = nlp.add_pipe("lemmatizer")
lemmatizer.initialize()
print([token.lemma_ for token in nlp("Is mise an t-aon duine")])

That covers the general case.

The second issue, which is specific to the Irish lemmatizer, is that, based on the implementation, it seems like the lemmatizer doesn't do anything without part of speech information, but we don't have a trained tagger, so you'd have to supply your own tagge…

View full answer

polm · 2022-06-29T05:08:35Z

polm
Jun 29, 2022

There are two issues here.

The general issue you're asking about is that when you call Irish() (or the same for other languages) you get a blank pipeline, but the Lemmatizer is a component you have to add. You can add it like this:

from spacy.lang.ga import Irish

nlp = Irish()
lemmatizer = nlp.add_pipe("lemmatizer")
lemmatizer.initialize()
print([token.lemma_ for token in nlp("Is mise an t-aon duine")])

That covers the general case.

The second issue, which is specific to the Irish lemmatizer, is that, based on the implementation, it seems like the lemmatizer doesn't do anything without part of speech information, but we don't have a trained tagger, so you'd have to supply your own tagger for it to work. Otherwise the lemma will always be the lowercase version of a word. I think other lemmatizers usually use a fallback strategy in the absence of POS data.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How to use lemmatizers in languages without packages #11047

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

How to use lemmatizers in languages without packages #11047

Uh oh!

zafercavdar Jun 28, 2022

Replies: 1 comment

Uh oh!

polm Jun 29, 2022

zafercavdar
Jun 28, 2022

polm
Jun 29, 2022