German lemmatizer based on outdated spelling rules #8695

eigenvektorin · 2021-07-09T09:13:36Z

eigenvektorin
Jul 9, 2021

How to reproduce the behaviour

import spacy nlp = spacy.load("de_core_news_sm") s= 'Ein verantwortungsbewusster und verantwortungsbewußter Mann.' s =nlp(s) adjectives = [word.lemma_ for word in s if word.pos_ == "ADJ"] print(adjectives)
The lemmatizer is not able to lemmatize "verantwortungsbewusster" correctly. Even though the spelling with "ss" is the only correct spelling since the spelling reform of 1996.

Environment

spaCy version: 3.0.6
Platform: Windows-10-10.0.19041-SP0
Python version: 3.9.5
Pipelines: de_core_news_sm (3.0.0),

adrianeboyd · 2021-07-12T09:26:15Z

adrianeboyd
Jul 12, 2021

Yes, the current German lemmatizer is not particularly good. It's a really simple lookup lemmatizer, which may only have entries for some words based on older sources from before the spelling reform. (I was even a bit surprised that there are any entries for forms of verantwortungsbewußt at all.)

In this case, it might be useful to have to extend the default lookup lemmatizer for German so that it also looks for spelling variants since they wouldn't be hard to generate and it wouldn't require extending the underlying table with a lot of near-duplicate entries.

0 replies

klemensz · 2021-08-27T10:00:23Z

klemensz
Aug 27, 2021

Hi! The German lemmatizer seems to have problems as well if a verb is written with initial upper case (which can happen quite often).

For example (tested with spaCy version 3.1):

Schaut euch das an! -> here the lemma of "Schaut" is recognized (incorrectly) as "Schaut".
schaut euch das an! -> here the lemma of "schaut" is recognized (correctly) as "schauen".

If I understood correctly, to make this work, an entry like this: "Schaut": "schauen", would need to be added to the German lemma table JSON?

2 replies

adrianeboyd Aug 27, 2021

You mean a verb, right?

You can add new entries for forms that aren't already included like Schaut, but this breaks down for many tokens that could either be nouns or verbs like Schauen [Sie] because there's already an entry for Schauen -> Schau as a noun. The simplest lookup lemmatizer that doesn't know anything about POS is just never going to be very good here.

You can modify a loaded pipeline for a lookup lemmatizer by editing this table:

nlp.get_pipe("lemmatizer").lookups.get_table("lemma_lookup")

If you save the pipeline with nlp.to_disk, these changes will be included when it's reloaded with spacy.load, so you don't have to go through the JSON table in spacy-lookups-data to update an existing pipeline. However, if you want to make these changes locally for all new pipelines, then you'd need a custom install of spacy-lookups-data with the changes.

klemensz Aug 27, 2021

Yes, sorry, I meant verb (and corrected it now).

Thanks for the additional info/instructions. What we're doing (so far) is to maintain an additional JSON file in the same format and then add the entries after spacy.load to the lookup table like this (dict_lemma_lookup is the loaded JSON data):

lookup_table = nlp.get_pipe("lemmatizer").lookups.get_table("lemma_lookup")
for key in dict_lemma_lookup:
    lookup_table.set(key, dict_lemma_lookup[key])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

German lemmatizer based on outdated spelling rules #8695

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

German lemmatizer based on outdated spelling rules #8695

Uh oh!

eigenvektorin Jul 9, 2021

How to reproduce the behaviour

Environment

Replies: 2 comments · 2 replies

Uh oh!

adrianeboyd Jul 12, 2021

Uh oh!

Uh oh!

klemensz Aug 27, 2021

Uh oh!

adrianeboyd Aug 27, 2021

Uh oh!

klemensz Aug 27, 2021

eigenvektorin
Jul 9, 2021

Replies: 2 comments 2 replies

adrianeboyd
Jul 12, 2021

klemensz
Aug 27, 2021