Is it possible to update or replace the language data within an instantiated language pipeline? #11254

afparsons · 2022-08-02T13:15:26Z

afparsons
Aug 2, 2022

Context

After following the documentation, I have the following:

# lex_attrs.py

def like_num(text: str) -> bool:
    """
    Broadens definition of LIKE_NUM to include long scale (milliard, billiard, trilliard) and other numerics.
    """
    ...

LEX_ATTRS = {LIKE_NUM: like_num}

# custom_english.py

from typing import Final
from {blah_blah_blah}.lex_attrs import LEX_ATTRS

_LANGUAGE_NAME: Final[str] = "en_custom"


class DefaultsCustomEnglish(English.Defaults):
    lex_attr_getters = LEX_ATTRS


@spacy.registry.languages(_LANGUAGE_NAME)
class CustomEnglish(English):
    lang = _LANGUAGE_NAME
    Defaults = DefaultsCustomEnglish

I have created a config.cfg for this custom Language, and that seems to work ✔️

Question

However, I was wondering if I could simply swap the language data of an existing pipeline, like so (shown with IPython):

In [1]: import spacy
In [2]: from {blah_blah_blah}.custom_english import DefaultsCustomEnglish

# below, we see that LIKE_NUM for "milliard" is False
In [3]: nlp = spacy.load("en_core_web_lg")   # or _sm or _md, doesn't matter
In [4]: doc = nlp("I have one milliard dollars.")
In [5]: analyze_tokens(doc, "like_num")
Out[5]: 
              0      1     2         3        4      5
text          I   have   one  milliard  dollars      .
like_num  False  False  True     False    False  False

# I switch the language data
In [6]: nlp.Defaults = DefaultsCustomEnglish
In [7]: nlp.Defaults.lex_attr_getters
Out[7]: {10: <function{blah_blah_blah}.lex_attrs.like_num(text: str) -> bool>}

# here we see that the function does indeed return True for "milliard"
In [8]: nlp.Defaults.lex_attr_getters[10]("milliard")
Out[8]: True

# however, the new language data does not take effect for the pipeline
In [9]: doc = nlp("I have one milliard dollars.")
In [10]: analyze_tokens(doc, "like_num")
Out[10]: 
              0      1     2         3        4      5
text          I   have   one  milliard  dollars      .
like_num  False  False  True     False    False  False

TL;DR: why can't I swap language data on the fly? Is there a way to do so? Or must I rely on a saved-to-disk pipeline?

analyze_tokens is a custom function found at this gist.

Answered by adrianeboyd

Aug 2, 2022

Some of the defaults are only read in right when the pipeline is loaded and there's a lexeme cache in nlp.vocab, so the lexical attributes are only calculated the first time a token is seen. There's no way to clear the lexeme cache other than reloading the pipeline.

In the example above I think you're seeing the lexeme cache, so it would work if you hadn't processed a text containing "milliard" before modifying the function.

Instead of modifying nlp.Defaults, it's generally recommended to modify EnglishDefaults before loading an en pipeline. (Note that this affects any English pipeline loaded within the same script.)

View full answer

adrianeboyd · 2022-08-02T14:50:27Z

adrianeboyd
Aug 2, 2022

Some of the defaults are only read in right when the pipeline is loaded and there's a lexeme cache in nlp.vocab, so the lexical attributes are only calculated the first time a token is seen. There's no way to clear the lexeme cache other than reloading the pipeline.

In the example above I think you're seeing the lexeme cache, so it would work if you hadn't processed a text containing "milliard" before modifying the function.

Instead of modifying nlp.Defaults, it's generally recommended to modify EnglishDefaults before loading an en pipeline. (Note that this affects any English pipeline loaded within the same script.)

2 replies

afparsons Aug 3, 2022
Author

so it would work if you hadn't processed a text containing "milliard" before modifying the function.

❌ Hmm, perhaps I am doing something wrong, but that did not work (I got the same result as in my original post).

Instead of modifying nlp.Defaults, it's generally recommended to modify EnglishDefaults

✔️ This does work, however! For future passers-by:

import spacy
from spacy.lang.en import EnglishDefaults
from {blah_blah_blah}.lex_attrs import LEX_ATTRS

EnglishDefaults.lex_attr_getters = LEX_ATTRS
nlp: Language = spacy_load("en_core_web_lg")

Note that this affects any English pipeline loaded within the same script.

That's fine for my use case. I'm simply constructing tests for components I've written.

adrianeboyd Aug 4, 2022

Thanks for reporting back, this is the better/cleaner solution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Is it possible to update or replace the language data within an instantiated language pipeline? #11254

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Is it possible to update or replace the language data within an instantiated language pipeline? #11254

Uh oh!

afparsons Aug 2, 2022

Context

Question

Replies: 1 comment · 2 replies

Uh oh!

adrianeboyd Aug 2, 2022

Uh oh!

Uh oh!

afparsons Aug 3, 2022 Author

Uh oh!

adrianeboyd Aug 4, 2022

afparsons
Aug 2, 2022

Replies: 1 comment 2 replies

adrianeboyd
Aug 2, 2022

afparsons Aug 3, 2022
Author