Losing POS Tagging & Other Token Attributes when Segmenting with Jieba or Pkuseg #12846
jonathanknebel
started this conversation in
Language Support
Replies: 1 comment 1 reply
-
Hi @creolio, in the first example (with the default segmenter) you're using the In your other two snippets, using |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I'm trying to ensure that I have accurate word segmentation/tokenization for Chinese while retaining access to token attributes such as part of speech, but it seems that when I switch segmenters from the default, I lose most of the token attribute data. I'm not training any custom models or anything like that.
My base jupyter notebook code looks like this:
With the above, I'm able to get both segmentation and token attributes, but am confused because I thought the default segmenter was "char". I'm using this as my solution for now, but would like to be able to play with other segmenters:

When I change:
nlp = spacy.load("zh_core_web_sm")
To:
I get this output:

Or if I use the following instead:
Or:
I get this output:

Using:
python3.10
spacy3.6
Beta Was this translation helpful? Give feedback.
All reactions