Skip to content
Discussion options

You must be logged in to vote

There are no major differences between these options except that you don't need to recreate the default tokenizer in case 2 or case 3. It is already created when you instantiate the pipeline:

# case 1
English = spacy.util.get_lang_class("en")
tokenizer = English.Defaults.create_tokenizer()

# case 2
nlp = spacy.blank("en")
tokenizer = nlp.tokenizer

# case 3
from spacy.lang.en import English
nlp = English()
tokenizer = nlp.tokenizer

If I had to pick one, I'd pick option 2 as the most standard / simple way to create a blank pipeline in a way that's easy to extend to multiple languages.

The tokenizers in the language models save the current defaults from spacy at the point when they were tr…

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by ines
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / tokenizer Feature: Tokenizer
2 participants