POS-tag a list of Japanese token that have already been tokenized #9983

BLKSerene · 2022-01-05T03:05:07Z

BLKSerene
Jan 5, 2022

Hi, I need to POS-tag a list of Japanese token that have already been tokenized. The following code snippet works for most languages with models.

>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> tokens = ['This', 'is', 'a', 'sentence', '.']
>>> doc = spacy.tokens.Doc(nlp.vocab, words = tokens, spaces = [False] * len(tokens))
>>> for pipe_name in nlp.pipe_names:
	nlp.get_pipe(pipe_name)(doc)
>>> for token in doc:
	print(token.text, token.tag_, token.pos_)

This DT PRON
is VBZ AUX
a DT DET
sentence NN NOUN
. . PUNCT

However, this does not work for the Japanese model (the fine-grained POS tags are missing).

>>> nlp = spacy.load('ja_core_news_sm')
>>> tokens = ['日本', '語', '（', 'にほん', 'ご', '、', 'にっぽん', 'ご', '[', '注', '2', ']', '、', '英', ':', 'Japanese', '）', 'は', '、', '日本', '国', '内', 'や', '、', 'かつて', 'の', '日本', '領', 'だっ', 'た', '国', '、', 'そして', '日本', '人', '同士', 'の', '間', 'で', '使用', 'さ', 'れ', 'て', 'いる', '言語', '。']
>>> doc = spacy.tokens.Doc(nlp.vocab, words = tokens, spaces = [False] * len(tokens))
>>> for pipe_name in nlp.pipe_names:
	nlp.get_pipe(pipe_name)(doc)
>>> for token in doc:
	print(token.text, token.tag_, token.pos_)

日本  PROPN
語  NOUN
（  NOUN
にほん  ADJ
ご  NOUN
、  PUNCT
にっぽん  PROPN
ご  NOUN
[  NOUN
注  NOUN
2  NUM
]  PUNCT
、  PUNCT
英  NOUN
:  SYM
Japanese  PROPN
）  PROPN
は  ADP
、  PUNCT
日本  PROPN
国  NOUN
内  NOUN
や  ADP
、  PUNCT
かつて  ADJ
の  ADP
日本  PROPN
領  NOUN
だっ  AUX
た  AUX
国  NOUN
、  PUNCT
そして  CCONJ
日本  PROPN
人  NOUN
同士  NOUN
の  ADP
間  NOUN
で  ADP
使用  NOUN
さ  PART
れ  NOUN
て  SCONJ
いる  VERB
言語  NOUN
。  PUNCT

spaCy version: 3.2.1

Answered by polm

Jan 5, 2022

However, this does not work for the Japanese model (the fine-grained POS tags are missing).

Japanese fine-grained part of speech tags are taken directly from SudachiPy output, so spaCy has no model for that. You can train a Tagger component to provide the tags; I know that's been done before with pretty good results.

For most Japanese tokenizers, tokenization is done jointly with (pseudo-)POS tag assignment, so I would expect your source of tokens to also give you tags. What tokenizer are you using?

View full answer

polm · 2022-01-05T06:09:14Z

polm
Jan 5, 2022

However, this does not work for the Japanese model (the fine-grained POS tags are missing).

Japanese fine-grained part of speech tags are taken directly from SudachiPy output, so spaCy has no model for that. You can train a Tagger component to provide the tags; I know that's been done before with pretty good results.

For most Japanese tokenizers, tokenization is done jointly with (pseudo-)POS tag assignment, so I would expect your source of tokens to also give you tags. What tokenizer are you using?

3 replies

BLKSerene Jan 5, 2022
Author

In my project, there are cases where I need to assign POS tags to user-provided corpus which might have been tokenized by some tokenizer (for example, SudachiPy or nagisa as for Japanese) but not POS-tagged.

BLKSerene Jan 7, 2022
Author

And it seems that the Japanese model also do not assign lemmas, am I right?

polm Jan 7, 2022

Lemmas are also provided by SudachiPy, so there is no native spaCy function for that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

POS-tag a list of Japanese token that have already been tokenized #9983

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

POS-tag a list of Japanese token that have already been tokenized #9983

Uh oh!

BLKSerene Jan 5, 2022

Replies: 1 comment · 3 replies

Uh oh!

Uh oh!

polm Jan 5, 2022

Uh oh!

BLKSerene Jan 5, 2022 Author

Uh oh!

BLKSerene Jan 7, 2022 Author

Uh oh!

polm Jan 7, 2022

BLKSerene
Jan 5, 2022

Replies: 1 comment 3 replies

polm
Jan 5, 2022

BLKSerene Jan 5, 2022
Author

BLKSerene Jan 7, 2022
Author