How to do Byte tokenization with Spacy? #11976
-
|
I want to change the tokenizer to byte-level tokenization However, this does not change the tokenization. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
|
The If you want to split on characters, you have two options:
|
Beta Was this translation helpful? Give feedback.
-
|
Thanks! There is only one problem. When using entity visualization, everything is fine; however, when switching to 'span' mode. The text looks like this. Is this an expected behavior or the span mode is not supporting the char tokenizer? It might look fine here but when the text becomes long enough it is not readable any more. |
Beta Was this translation helpful? Give feedback.

The
Vocabfor the tokenizer is a spacy-specific object, not the kind of tokenizer vocab you're thinking of.If you want to split on characters, you have two options:
Doc:spacy-experimental==0.6.1you can usespacy-experimental.char_pretokenizer.v1in your config or by overriding the default config