After to_bytes without vocab, and from_bytes, lang_ is None in Doc objects #4390

gthb · 2019-10-07T12:42:13Z

gthb
Oct 7, 2019

How to reproduce the behaviour

The following was written to not have to wait 15 seconds on each pytest invocation of our test suite, where we need to tokenize strings exactly as Spacy's language model does, but don't need the vocab:

def get_spacy_tokenizer(lang: str) -> Language:
    """Returns just the tokenizer part extracted from the given Spacy language model, cached on disk. Loads faster, so tests start up faster"""
    tokenizer_filename = f"{lang}_tokenizer.bytes"
    if not os.path.exists(tokenizer_filename):
        lang_model = get_language_model(lang)
        with open(tokenizer_filename, 'wb') as tokenizer_file:
            tokenizer_file.write(lang_model.to_bytes(exclude=['vocab']))
    with open(tokenizer_filename, 'rb') as tokenizer_file:
        lang_model = Language().from_bytes(tokenizer_file.read())
        lang_model.meta["lang"] = lang
        return lang_model

(where get_language_model just looks up the model name we've configured for the language, and calls spacy.load)

That works great, speeds up test startup a lot ... but then it turns out that the Doc objects from this language model have doc.lang_ is None, despite the meta attribute being present on the language model (I'm poking it in there, trying unsuccessfully to work around this).

This is because doc.lang_ is a property proxying to doc.vocab.lang and of course doc.vocab is not loaded.

But the language is a property of the language model itself, not just of its vocabulary (though of course they ought to match). So I don't think doc.lang_ should break just because the vocabulary isn't loaded.

Info about spaCy

spaCy version: 2.1.6
Platform: Darwin-18.7.0-x86_64-i386-64bit
Python version: 3.7.4

Answered by ines

Oct 7, 2019

This is because doc.lang_ is a property proxying to doc.vocab.lang and of course doc.vocab is not loaded.

That's correct, yes. In your code, you'll still have a Vocab btw – the Language class initializes this automatically. It's just that your Vocab is blank and doesn't have a language assigned.

The meta["lang"] setting exists so that you can create an instance of the same Language subclass – e.g. via util.get_lang_class(meta["lang"]). This is also how spaCy does it under the hood when you load a model.

I'm assuming you don't want to load the vocab because or the word vectors? The following shouldn't be slower than what you currently have:

lang_cls = spacy.util.get_lang_class(lang)
lang…

View full answer

ines · 2019-10-07T14:57:25Z

ines
Oct 7, 2019
Maintainer

This is because doc.lang_ is a property proxying to doc.vocab.lang and of course doc.vocab is not loaded.

That's correct, yes. In your code, you'll still have a Vocab btw – the Language class initializes this automatically. It's just that your Vocab is blank and doesn't have a language assigned.

The meta["lang"] setting exists so that you can create an instance of the same Language subclass – e.g. via util.get_lang_class(meta["lang"]). This is also how spaCy does it under the hood when you load a model.

I'm assuming you don't want to load the vocab because or the word vectors? The following shouldn't be slower than what you currently have:

lang_cls = spacy.util.get_lang_class(lang)
lang_model = lang_cls().from_bytes(tokenizer_file.read())

You might also want to choose a different name and not call it tokenizer – strictly speaking, you're serializing the whole model and pipeline here, just without the vocab, not nlp.tokenizer.

0 replies

svlandeg · 2020-10-16T14:03:21Z

svlandeg
Oct 16, 2020

I think this has been addressed by Ines' explanations and suggestions? If not - feel free to open a new issue!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

After to_bytes without vocab, and from_bytes, lang_ is None in Doc objects #4390

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

After to_bytes without vocab, and from_bytes, lang_ is None in Doc objects #4390

Uh oh!

Uh oh!

gthb Oct 7, 2019

How to reproduce the behaviour

Info about spaCy

Replies: 2 comments

Uh oh!

ines Oct 7, 2019 Maintainer

Uh oh!

svlandeg Oct 16, 2020

gthb
Oct 7, 2019

ines
Oct 7, 2019
Maintainer

svlandeg
Oct 16, 2020