Lemmatization is not working for Chinese language #10386

nikhil-sec · 2022-02-28T06:17:59Z

nikhil-sec
Feb 28, 2022

How to reproduce the behaviour

from importlib import import_module
lang_module = import_module('zh_core_web_sm').load()
doc = lang_module("升级后进程为空白，填写字段后未运行")
lemmed_string = ''
for word in doc:
    lemmed_string = lemmed_string + ' ' + word.lemma_
print(lemmed_string)

The above program produces empty output.
I came across some articles which say lemmatization is not required for the Chinese language. I guess, the Japanese also falls under the same category as the Chinese, and lemmatization is working properly for the Japanese language in Spacy.

Earlier, same code was working for Spacy version 2.3.2

Your Environment

Operating System: Windows and Linux
Python Version Used: 3.7
spaCy Version Used: 3.2.2
Environment Information:

polm · 2022-02-28T06:34:29Z

polm
Feb 28, 2022

Chinese doesn't have a lemmatizer directly in spaCy. The change in lemmas you've seen is probably due to changes in how segmentation works between spaCy v2 and v3; see the Chinese support notes for details. I think if you use jieba you should be able to get lemmas, as their documentation indicates they have functionality for it. That might not work as well with the pretrained pipelines though, which use a special pkuseg model for compatibility with the training data.

I came across some articles which say lemmatization is not required for the Chinese language. I guess, the Japanese also falls under the same category as the Chinese, and lemmatization is working properly for the Japanese language in Spacy.

Chinese and Japanese both use external tokenizers. In Japanese, SudachiPy provides lemmas along with tokenization. Besides that there's not much in common in how they are handled in spaCy or their needs with lemmatization.

Japanese has significant inflection for several important word classes. I don't speak Chinese but I understand inflection is rare, which may explain the articles you read, but that wouldn't apply to Japanese.

2 replies

nikhil-sec Feb 28, 2022
Author

Thanks @polm for the detailed information. Can we conclude, Spacy's Chinese pretrained model doesn't support lemmatization?
Here, I see a discrepancy in pretrained model as Spacy's out-of-the-box lemmatization is working for some languages and not for others like Chinese.
Is there any way to make lemmatization work for pretrained Chinese model by using jieba and pkuseg model?

polm Feb 28, 2022

Note that looking more closely at the jieba docs, what they refer to as "lemmas" seem to just be the tokens that result from segmentation without part of speech information, so I was wrong that jieba actually performs lemmatization.

https://www.programcreek.com/python/example/105295/jieba.add_word

adrianeboyd · 2022-02-28T07:48:43Z

adrianeboyd
Feb 28, 2022

I think the main difference between v2.3 and v3 here is the built-in lemma backoff. In v2.3 you would get token.text if no other lemma was provided by rule-based or lookup lemmatizers, but this isn't done by default in v3.

If you want, you can write a simple pipeline component that copies token.text (same as token.orth_) to token.lemma_, or you can back off to token.text in your own processing if there is no lemma set. You can tell that no lemma is set if token.lemma is 0 or token.lemma_ is the empty string.

As far as I know jieba doesn't provide lemmas. It primarily does word segmentation and I think it can also provide part-of-speech tags, but in spacy there's only support for segmentation.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Lemmatization is not working for Chinese language #10386

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Lemmatization is not working for Chinese language #10386

Uh oh!

nikhil-sec Feb 28, 2022

How to reproduce the behaviour

Your Environment

Replies: 2 comments · 2 replies

Uh oh!

polm Feb 28, 2022

Uh oh!

nikhil-sec Feb 28, 2022 Author

Uh oh!

polm Feb 28, 2022

Uh oh!

adrianeboyd Feb 28, 2022

nikhil-sec
Feb 28, 2022

Replies: 2 comments 2 replies

polm
Feb 28, 2022

nikhil-sec Feb 28, 2022
Author

adrianeboyd
Feb 28, 2022