Skip to content
Discussion options

You must be logged in to vote

The Vocab for the tokenizer is a spacy-specific object, not the kind of tokenizer vocab you're thinking of.

If you want to split on characters, you have two options:

  • Create docs manually with Doc:
    from spacy.tokens import Doc
    doc = Doc(nlp.vocab, words=["T", "h", "e", " ", ...], spaces=[False, False, False, False, ...])
  • Use a custom tokenizer. If you install spacy-experimental==0.6.1 you can use spacy-experimental.char_pretokenizer.v1 in your config or by overriding the default config
    nlp = spacy.blank("en", config={"nlp": {"tokenizer": {"@tokenizers": "spacy-experimental.char_pretokenizer.v1"}}})
    It's a simple custom tokenizer that's doing the manual creation described above: https://g…

Replies: 2 comments 1 reply

Comment options

You must be logged in to vote
0 replies
Answer selected by sadransh
Comment options

You must be logged in to vote
1 reply
@polm
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / tokenizer Feature: Tokenizer
3 participants