How to do Byte tokenization with Spacy? #11976

sadransh · 2022-12-15T06:56:07Z

sadransh
Dec 15, 2022

I want to change the tokenizer to byte-level tokenization
so I can have spans covering partial words for visualization.
I tried to use a list of chars with tokenizer tokenizer = Tokenizer(Vocab(strings=list("abcdegf..."))

However, this does not change the tokenization.

Answered by adrianeboyd

Dec 15, 2022

The Vocab for the tokenizer is a spacy-specific object, not the kind of tokenizer vocab you're thinking of.

If you want to split on characters, you have two options:

Create docs manually with Doc:

from spacy.tokens import Doc
doc = Doc(nlp.vocab, words=["T", "h", "e", " ", ...], spaces=[False, False, False, False, ...])

Use a custom tokenizer. If you install spacy-experimental==0.6.1 you can use spacy-experimental.char_pretokenizer.v1 in your config or by overriding the default config
```
nlp = spacy.blank("en", config={"nlp": {"tokenizer": {"@tokenizers": "spacy-experimental.char_pretokenizer.v1"}}})
```
It's a simple custom tokenizer that's doing the manual creation described above: https://g…

View full answer

adrianeboyd · 2022-12-15T07:59:01Z

adrianeboyd
Dec 15, 2022

The Vocab for the tokenizer is a spacy-specific object, not the kind of tokenizer vocab you're thinking of.

If you want to split on characters, you have two options:

Create docs manually with Doc:

from spacy.tokens import Doc
doc = Doc(nlp.vocab, words=["T", "h", "e", " ", ...], spaces=[False, False, False, False, ...])

Use a custom tokenizer. If you install spacy-experimental==0.6.1 you can use spacy-experimental.char_pretokenizer.v1 in your config or by overriding the default config
```
nlp = spacy.blank("en", config={"nlp": {"tokenizer": {"@tokenizers": "spacy-experimental.char_pretokenizer.v1"}}})
```
It's a simple custom tokenizer that's doing the manual creation described above: https://github.com/explosion/spacy-experimental/blob/v0.6.1/spacy_experimental/char_tokenizer/char_pretokenizer.py

0 replies

sadransh · 2022-12-15T16:57:45Z

sadransh
Dec 15, 2022
Author

Thanks!

There is only one problem. When using entity visualization, everything is fine; however, when switching to 'span' mode. The text looks like this. Is this an expected behavior or the span mode is not supporting the char tokenizer?

It might look fine here but when the text becomes long enough it is not readable any more.

1 reply

polm Dec 23, 2022

This is expected, since with character tokenization every character is a token, and the splits are shown between token boundaries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How to do Byte tokenization with Spacy? #11976

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

How to do Byte tokenization with Spacy? #11976

Uh oh!

sadransh Dec 15, 2022

Replies: 2 comments · 1 reply

Uh oh!

adrianeboyd Dec 15, 2022

Uh oh!

Uh oh!

sadransh Dec 15, 2022 Author

Uh oh!

polm Dec 23, 2022

sadransh
Dec 15, 2022

Replies: 2 comments 1 reply

adrianeboyd
Dec 15, 2022

sadransh
Dec 15, 2022
Author