Lemmatization is not working for Chinese language #10386
Replies: 2 comments 2 replies
-
Chinese doesn't have a lemmatizer directly in spaCy. The change in lemmas you've seen is probably due to changes in how segmentation works between spaCy v2 and v3; see the Chinese support notes for details. I think if you use jieba you should be able to get lemmas, as their documentation indicates they have functionality for it. That might not work as well with the pretrained pipelines though, which use a special pkuseg model for compatibility with the training data.
Chinese and Japanese both use external tokenizers. In Japanese, SudachiPy provides lemmas along with tokenization. Besides that there's not much in common in how they are handled in spaCy or their needs with lemmatization. Japanese has significant inflection for several important word classes. I don't speak Chinese but I understand inflection is rare, which may explain the articles you read, but that wouldn't apply to Japanese. |
Beta Was this translation helpful? Give feedback.
-
I think the main difference between v2.3 and v3 here is the built-in lemma backoff. In v2.3 you would get If you want, you can write a simple pipeline component that copies As far as I know |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
How to reproduce the behaviour
The above program produces empty output.
I came across some articles which say lemmatization is not required for the Chinese language. I guess, the Japanese also falls under the same category as the Chinese, and lemmatization is working properly for the Japanese language in Spacy.
Earlier, same code was working for Spacy version 2.3.2
Your Environment
Beta Was this translation helpful? Give feedback.
All reactions