POS-tag a list of Japanese token that have already been tokenized #9983
-
Hi, I need to POS-tag a list of Japanese token that have already been tokenized. The following code snippet works for most languages with models.
However, this does not work for the Japanese model (the fine-grained POS tags are missing).
spaCy version: 3.2.1 |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
Japanese fine-grained part of speech tags are taken directly from SudachiPy output, so spaCy has no model for that. You can train a Tagger component to provide the tags; I know that's been done before with pretty good results. For most Japanese tokenizers, tokenization is done jointly with (pseudo-)POS tag assignment, so I would expect your source of tokens to also give you tags. What tokenizer are you using? |
Beta Was this translation helpful? Give feedback.
Japanese fine-grained part of speech tags are taken directly from SudachiPy output, so spaCy has no model for that. You can train a Tagger component to provide the tags; I know that's been done before with pretty good results.
For most Japanese tokenizers, tokenization is done jointly with (pseudo-)POS tag assignment, so I would expect your source of tokens to also give you tags. What tokenizer are you using?