Japanese transformers-based model #9323

hiroshi-matsuda-rit · 2021-09-29T07:52:21Z

hiroshi-matsuda-rit
Sep 29, 2021

After discussions with @polm, @tamuhey, and @KoichiYasuoka in the spaCy Japanese community, we came to the conclusion that we would propose the publication of a transformers-based Japanese analysis model:

Base model
- https://huggingface.co/cl-tohoku/bert-base-japanese-char-v2
Tokenizer settings
- Use the basic pretokenizer instead of mecab to remove fugashi and unidic-lite dependencies
- This replacement of pretokenizer is very easy by adding a [components.transformer.model.tokenizer_config] section in config file
- Training configuration example: https://github.com/megagonlabs/UD_Japanese-GSD/blob/master/spacy/config/ja_gsd_bert_char_v2_basic.cfg#L96

With these settings, the accuracy of spaCy v3 Japanese model will be greatly improved.
Although not as good as the fully pretokenized BERT model, the character-based BERT model can eliminate memory consumption, pretokenizer overhead, and meny-to-many token alignments.
https://github.com/megagonlabs/UD_Japanese-GSD/blob/master/leader_board.md
https://docs.google.com/spreadsheets/d/1D1MvywJYCSVCHaL8p9MRjwL7CHmWQFLIH--kqW8DSmQ/edit#gid=743830749

hiroshi-matsuda-rit · 2021-09-29T07:53:15Z

hiroshi-matsuda-rit
Sep 29, 2021
Author

Related issues: #8677 #8977

0 replies

adrianeboyd · 2021-09-29T12:57:31Z

adrianeboyd
Sep 29, 2021

Thanks, we'll try it out!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Japanese transformers-based model #9323

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Japanese transformers-based model #9323

Uh oh!

hiroshi-matsuda-rit Sep 29, 2021

Replies: 2 comments

Uh oh!

hiroshi-matsuda-rit Sep 29, 2021 Author

Uh oh!

adrianeboyd Sep 29, 2021

hiroshi-matsuda-rit
Sep 29, 2021

hiroshi-matsuda-rit
Sep 29, 2021
Author

adrianeboyd
Sep 29, 2021