How to get per-language tokenizers without loading language models #5886

alexcombessie · 2020-08-06T08:57:13Z

alexcombessie
Aug 6, 2020

Hi,

First of all, thanks for the great work going on here! I have a tricky question which I didn't manage to properly answer by reading the documentation.

I am working on application with a tokenization pipeline for multiple languages. The application is meant to be lightweight and I don't need the tagger/parser/ner features. Hence, I need the application to work without downloading the language models.

I have browsed the documentation, in particular https://spacy.io/api/tokenizer#init and read some of the code.

From my understanding, there are 3 ways to achieve this:

# case 1
nlp = spacy.util.get_lang_class("en")
tokenizer = nlp.Default.create_tokenizer()

# case 2
nlp = spacy.blank("en")
tokenizer = nlp.Default.create_tokenizer()

# case 3
from spacy.lang.en import English
nlp = English()
tokenizer = nlp.Default.create_tokenizer()

Could you please confirm:

which case I should use to avoid loading the language model?
if there is any difference in functionalities between a tokenizer loaded with the language model and without?

Thanks a lot,

Alex

Answered by adrianeboyd

Aug 6, 2020

There are no major differences between these options except that you don't need to recreate the default tokenizer in case 2 or case 3. It is already created when you instantiate the pipeline:

# case 1
English = spacy.util.get_lang_class("en")
tokenizer = English.Defaults.create_tokenizer()

# case 2
nlp = spacy.blank("en")
tokenizer = nlp.tokenizer

# case 3
from spacy.lang.en import English
nlp = English()
tokenizer = nlp.tokenizer

If I had to pick one, I'd pick option 2 as the most standard / simple way to create a blank pipeline in a way that's easy to extend to multiple languages.

The tokenizers in the language models save the current defaults from spacy at the point when they were tr…

View full answer

adrianeboyd · 2020-08-06T11:20:11Z

adrianeboyd
Aug 6, 2020

There are no major differences between these options except that you don't need to recreate the default tokenizer in case 2 or case 3. It is already created when you instantiate the pipeline:

# case 1
English = spacy.util.get_lang_class("en")
tokenizer = English.Defaults.create_tokenizer()

# case 2
nlp = spacy.blank("en")
tokenizer = nlp.tokenizer

# case 3
from spacy.lang.en import English
nlp = English()
tokenizer = nlp.tokenizer

If I had to pick one, I'd pick option 2 as the most standard / simple way to create a blank pipeline in a way that's easy to extend to multiple languages.

The tokenizers in the language models save the current defaults from spacy at the point when they were trained, so you will find minor differences across model versions even for the same language. When you load a language model, all the tokenizer settings are loaded from the model, not from the current spacy install, so you may see minor differences between a language model and a particular spacy release.

If you want to save particular settings, you can pin your project to a particular spacy release to always get the same defaults or you can use nlp.to_disk() to save the current settings as a model that you can reload with spacy.load().

0 replies

alexcombessie · 2020-08-06T16:41:31Z

alexcombessie
Aug 6, 2020
Author

Thanks @adrianeboyd, that's really helpful and understandable.

If you could add that somewhere on the documentation, I think that'd be valuable for others!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to get per-language tokenizers without loading language models #5886

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

How to get per-language tokenizers without loading language models #5886

Uh oh!

Uh oh!

alexcombessie Aug 6, 2020

Replies: 2 comments

Uh oh!

adrianeboyd Aug 6, 2020

Uh oh!

alexcombessie Aug 6, 2020 Author

alexcombessie
Aug 6, 2020

adrianeboyd
Aug 6, 2020

alexcombessie
Aug 6, 2020
Author