Is it possible to update or replace the language data within an instantiated language pipeline? #11254
-
ContextAfter following the documentation, I have the following: # lex_attrs.py
def like_num(text: str) -> bool:
"""
Broadens definition of LIKE_NUM to include long scale (milliard, billiard, trilliard) and other numerics.
"""
...
LEX_ATTRS = {LIKE_NUM: like_num} # custom_english.py
from typing import Final
from {blah_blah_blah}.lex_attrs import LEX_ATTRS
_LANGUAGE_NAME: Final[str] = "en_custom"
class DefaultsCustomEnglish(English.Defaults):
lex_attr_getters = LEX_ATTRS
@spacy.registry.languages(_LANGUAGE_NAME)
class CustomEnglish(English):
lang = _LANGUAGE_NAME
Defaults = DefaultsCustomEnglish I have created a QuestionHowever, I was wondering if I could simply swap the language data of an existing pipeline, like so (shown with IPython): In [1]: import spacy
In [2]: from {blah_blah_blah}.custom_english import DefaultsCustomEnglish
# below, we see that LIKE_NUM for "milliard" is False
In [3]: nlp = spacy.load("en_core_web_lg") # or _sm or _md, doesn't matter
In [4]: doc = nlp("I have one milliard dollars.")
In [5]: analyze_tokens(doc, "like_num")
Out[5]:
0 1 2 3 4 5
text I have one milliard dollars .
like_num False False True False False False
# I switch the language data
In [6]: nlp.Defaults = DefaultsCustomEnglish
In [7]: nlp.Defaults.lex_attr_getters
Out[7]: {10: <function{blah_blah_blah}.lex_attrs.like_num(text: str) -> bool>}
# here we see that the function does indeed return True for "milliard"
In [8]: nlp.Defaults.lex_attr_getters[10]("milliard")
Out[8]: True
# however, the new language data does not take effect for the pipeline
In [9]: doc = nlp("I have one milliard dollars.")
In [10]: analyze_tokens(doc, "like_num")
Out[10]:
0 1 2 3 4 5
text I have one milliard dollars .
like_num False False True False False False TL;DR: why can't I swap language data on the fly? Is there a way to do so? Or must I rely on a saved-to-disk pipeline?
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Some of the defaults are only read in right when the pipeline is loaded and there's a lexeme cache in In the example above I think you're seeing the lexeme cache, so it would work if you hadn't processed a text containing "milliard" before modifying the function. Instead of modifying |
Beta Was this translation helpful? Give feedback.
Some of the defaults are only read in right when the pipeline is loaded and there's a lexeme cache in
nlp.vocab
, so the lexical attributes are only calculated the first time a token is seen. There's no way to clear the lexeme cache other than reloading the pipeline.In the example above I think you're seeing the lexeme cache, so it would work if you hadn't processed a text containing "milliard" before modifying the function.
Instead of modifying
nlp.Defaults
, it's generally recommended to modifyEnglishDefaults
before loading anen
pipeline. (Note that this affects any English pipeline loaded within the same script.)