German lemmatizer based on outdated spelling rules #8695
Replies: 2 comments 2 replies
-
Yes, the current German lemmatizer is not particularly good. It's a really simple lookup lemmatizer, which may only have entries for some words based on older sources from before the spelling reform. (I was even a bit surprised that there are any entries for forms of In this case, it might be useful to have to extend the default lookup lemmatizer for German so that it also looks for spelling variants since they wouldn't be hard to generate and it wouldn't require extending the underlying table with a lot of near-duplicate entries. |
Beta Was this translation helpful? Give feedback.
-
Hi! The German lemmatizer seems to have problems as well if a verb is written with initial upper case (which can happen quite often). For example (tested with spaCy version 3.1):
If I understood correctly, to make this work, an entry like this: |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
How to reproduce the behaviour
import spacy nlp = spacy.load("de_core_news_sm") s= 'Ein verantwortungsbewusster und verantwortungsbewußter Mann.' s =nlp(s) adjectives = [word.lemma_ for word in s if word.pos_ == "ADJ"] print(adjectives)
The lemmatizer is not able to lemmatize "verantwortungsbewusster" correctly. Even though the spelling with "ss" is the only correct spelling since the spelling reform of 1996.
Environment
spaCy version: 3.0.6
Platform: Windows-10-10.0.19041-SP0
Python version: 3.9.5
Pipelines: de_core_news_sm (3.0.0),
Beta Was this translation helpful? Give feedback.
All reactions