Japanese transformers-based model #9323
hiroshi-matsuda-rit
started this conversation in
Language Support
Replies: 2 comments
-
Beta Was this translation helpful? Give feedback.
0 replies
-
Thanks, we'll try it out! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
After discussions with @polm, @tamuhey, and @KoichiYasuoka in the spaCy Japanese community, we came to the conclusion that we would propose the publication of a transformers-based Japanese analysis model:
basic
pretokenizer instead ofmecab
to removefugashi
andunidic-lite
dependencies[components.transformer.model.tokenizer_config]
section in config fileWith these settings, the accuracy of spaCy v3 Japanese model will be greatly improved.
Although not as good as the fully pretokenized BERT model, the character-based BERT model can eliminate memory consumption, pretokenizer overhead, and meny-to-many token alignments.
https://github.com/megagonlabs/UD_Japanese-GSD/blob/master/leader_board.md
https://docs.google.com/spreadsheets/d/1D1MvywJYCSVCHaL8p9MRjwL7CHmWQFLIH--kqW8DSmQ/edit#gid=743830749
Beta Was this translation helpful? Give feedback.
All reactions