How to get per-language tokenizers without loading language models #5886
-
|
Hi, First of all, thanks for the great work going on here! I have a tricky question which I didn't manage to properly answer by reading the documentation. I am working on application with a tokenization pipeline for multiple languages. The application is meant to be lightweight and I don't need the tagger/parser/ner features. Hence, I need the application to work without downloading the language models. I have browsed the documentation, in particular https://spacy.io/api/tokenizer#init and read some of the code. From my understanding, there are 3 ways to achieve this: Could you please confirm:
Thanks a lot, Alex |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
|
There are no major differences between these options except that you don't need to recreate the default tokenizer in case 2 or case 3. It is already created when you instantiate the pipeline: # case 1
English = spacy.util.get_lang_class("en")
tokenizer = English.Defaults.create_tokenizer()
# case 2
nlp = spacy.blank("en")
tokenizer = nlp.tokenizer
# case 3
from spacy.lang.en import English
nlp = English()
tokenizer = nlp.tokenizerIf I had to pick one, I'd pick option 2 as the most standard / simple way to create a blank pipeline in a way that's easy to extend to multiple languages. The tokenizers in the language models save the current defaults from spacy at the point when they were trained, so you will find minor differences across model versions even for the same language. When you load a language model, all the tokenizer settings are loaded from the model, not from the current spacy install, so you may see minor differences between a language model and a particular spacy release. If you want to save particular settings, you can pin your project to a particular spacy release to always get the same defaults or you can use |
Beta Was this translation helpful? Give feedback.
-
|
Thanks @adrianeboyd, that's really helpful and understandable. If you could add that somewhere on the documentation, I think that'd be valuable for others! |
Beta Was this translation helpful? Give feedback.
There are no major differences between these options except that you don't need to recreate the default tokenizer in case 2 or case 3. It is already created when you instantiate the pipeline:
If I had to pick one, I'd pick option 2 as the most standard / simple way to create a blank pipeline in a way that's easy to extend to multiple languages.
The tokenizers in the language models save the current defaults from spacy at the point when they were tr…