Losing POS Tagging & Other Token Attributes when Segmenting with Jieba or Pkuseg #12846

jonathanknebel · 2023-07-20T20:59:08Z

jonathanknebel
Jul 20, 2023

I'm trying to ensure that I have accurate word segmentation/tokenization for Chinese while retaining access to token attributes such as part of speech, but it seems that when I switch segmenters from the default, I lose most of the token attribute data. I'm not training any custom models or anything like that.

My base jupyter notebook code looks like this:

!pip install tabulate
!pip install --upgrade spacy
!spacy download zh_core_web_sm
!pip install spacy-pkuseg 

import spacy
from spacy.lang.zh import Chinese
from tabulate import tabulate

nlp = spacy.load("zh_core_web_sm")
doc = nlp("有些愛卻不得不各安天涯")

table_data = []
headers = ["Token", "POS", "Is Stop", "Is OOV", "Head", "Dependency", "Lemma", "Tag", "Morphology", "Shape", "Probability", "Sentiment", "Whitespace"]

for token in doc:
    table_data.append([
        token.text, token.pos_, token.is_stop, token.is_oov, token.head.text, token.dep_,
        token.lemma_, token.tag_, token.morph,
        token.shape_, token.prob, token.sentiment, token.whitespace_
    ])

print(tabulate(table_data, headers=headers))

With the above, I'm able to get both segmentation and token attributes, but am confused because I thought the default segmenter was "char". I'm using this as my solution for now, but would like to be able to play with other segmenters:

When I change:
nlp = spacy.load("zh_core_web_sm")

To:

cfg = {"segmenter": "jieba"}
nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})

I get this output:

Or if I use the following instead:

cfg = {"segmenter": "pkuseg"}
nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
nlp.tokenizer.initialize(pkuseg_model="spacy_ontonotes")

Or:

cfg = {"segmenter": "pkuseg"}
nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
nlp.tokenizer.initialize(pkuseg_model="mixed")

I get this output:

Using:

python3.10
spacy3.6

rmitsch · 2023-07-21T07:48:43Z

rmitsch
Jul 21, 2023
Maintainer

Hi @creolio, in the first example (with the default segmenter) you're using the zh_core_web_sm pipeline. This is a pretrained model including more components than just the tokenizer, which is why you're obtaining additional information like dependencies and tags.

In your other two snippets, using jieba and pkuseg respectively, you're configuring only a tokenizer component. You can infer additional information by configuring the corresponding components in your pipeline(s).

1 reply

jonathanknebel Jun 27, 2024
Author

Here I am a year later responding, so my apologies for that, but thank you for your help. It gives me some direction on what to research.

I'm still confused, though.

Would you be able to give me an example of how to adjust my code to configure a tagger in my pipeline?

Ultimately, I'd like tokenization with pkuseg + pos tagging, ner, etc,.. I wasn't planning on doing any training of my own models, but if that's what you're saying I need to do, please clarify. If I can achieve all of this with existing models, I'd prefer that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Losing POS Tagging & Other Token Attributes when Segmenting with Jieba or Pkuseg #12846

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Losing POS Tagging & Other Token Attributes when Segmenting with Jieba or Pkuseg #12846

Uh oh!

jonathanknebel Jul 20, 2023

Using:

Replies: 1 comment · 1 reply

Uh oh!

rmitsch Jul 21, 2023 Maintainer

Uh oh!

jonathanknebel Jun 27, 2024 Author

jonathanknebel
Jul 20, 2023

Replies: 1 comment 1 reply

rmitsch
Jul 21, 2023
Maintainer

jonathanknebel Jun 27, 2024
Author