making a trained model inherit custom tokenizer #8498

DSLituiev · 2021-06-24T17:24:59Z

DSLituiev
Jun 24, 2021

I am training a model that needs "8)" emoji disabled .

I use python -m spacy train config-filled.cfg --code ./functions.py --output my-best-model-ever. When I take that model to run prodigy ner.teach, I see prodigy again groups "8)" into one token.

How can I make sure the code is inherited in the checkpoints?
Do I have to copy it somewhere manually?

svlandeg · 2021-06-24T21:00:57Z

svlandeg
Jun 24, 2021

Can you provide a bit more background? How did you implement the custom tokenizer - did you register a custom one and do you see it in the config file of the model that is saved to disk by spacy train? Are you providing its implementation to Prodigy with -F?

I'd be happy to look into this further, but it would be good to have some code to be able to replicate: the custom tokenizer, an example config file, and an example snippet of input text you're feeding to ner.teach.

6 replies

DSLituiev Jun 25, 2021
Author

Calling prodigy teach:

spacy_model="/Users/.../round1/checkpoints-general-onco-frac/model-best"
dataset="pathology-various-round2"
source="my-data-source.jsonl"

prodigy ner.teach $dataset $spacy_model $source

After I'm done with the binary prodigy ner.teach and try to train using prodigy again:

prodigy train --ner $dataset -l  $spacy_model  checkpoints-round2 # -b $dataset

it complains:

========================= Generating Prodigy config =========================
ℹ Auto-generating config with spaCy
Traceback (most recent call last):
  File "/Applications/anaconda3/envs/nlp/lib/python3.7/site-packages/spacy/util.py", line 246, in get_lang_class
    module = importlib.import_module(f".lang.{lang}", "spacy")
  File "/Applications/anaconda3/envs/nlp/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 965, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'spacy.lang./Users/dlituiev/repos/infocommons/briton/Path_nlp/spacy/general-onco/round1/checkpoints-general-onco-frac/model-best'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Applications/anaconda3/envs/nlp/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Applications/anaconda3/envs/nlp/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Applications/anaconda3/envs/nlp/lib/python3.7/site-packages/prodigy/__main__.py", line 54, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 329, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/Applications/anaconda3/envs/nlp/lib/python3.7/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/Applications/anaconda3/envs/nlp/lib/python3.7/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/Applications/anaconda3/envs/nlp/lib/python3.7/site-packages/prodigy/recipes/train.py", line 242, in train
    silent=silent,
  File "/Applications/anaconda3/envs/nlp/lib/python3.7/site-packages/prodigy/recipes/train.py", line 100, in prodigy_config
    base_nlp, base_model, list(pipes), lang, optimize, gpu, silent
  File "/Applications/anaconda3/envs/nlp/lib/python3.7/site-packages/prodigy/recipes/train.py", line 515, in generate_config
    config = init_config(lang=lang, pipeline=pipes, optimize=optimize, gpu=gpu)
  File "/Applications/anaconda3/envs/nlp/lib/python3.7/site-packages/spacy/cli/init_config.py", line 187, in init_config
    nlp = util.load_model_from_config(config, auto_fill=True)
  File "/Applications/anaconda3/envs/nlp/lib/python3.7/site-packages/spacy/util.py", line 427, in load_model_from_config
    lang_cls = get_lang_class(nlp_config["lang"])
  File "/Applications/anaconda3/envs/nlp/lib/python3.7/site-packages/spacy/util.py", line 248, in get_lang_class
    raise ImportError(Errors.E048.format(lang=lang, err=err)) from err
ImportError: [E048] Can't import language
from spacy.lang: No module named 'spacy.lang./Users/.../round1/checkpoints-general-onco-frac/model-best'

which seems like it wants my spacy_model to be installed with pip, even though it was OK using it for pre-labelling from a disk location. Do I interpret it correctly? Is there a way to use a model from disk as a pre-trained checkpoint -- it would be much easier if one can use a model w/o compiling and installing it, given I'm still iterating on my models.

I'm not 100% sure I'm in the right prodigy version, and prodigy --help doesn't show the version, and prodigy --version does not exist.

DSLituiev Jun 25, 2021
Author

prodigy was installed from prodigy-1.11.0a8-cp36.cp37.cp38.cp39-cp36m.cp37m.cp38.cp39-macosx_10_14_x86_64.whl

DSLituiev Jun 25, 2021
Author

I realized that the initial pre-trained model needs to be fed with --base-model.

prodigy train --ner $dataset -l en --base-model  $spacy_model  checkpoints-round2

Now that works, but I'm getting an error because the pre-trained model doesn't save the custom code:


catalogue.RegistryError: [E893] Could not find function 'customize_tokenizer_pathology' in function registry 'callbacks'. If you're using a custom function, make sure the code is available. If the function is provided by a third-party package, e.g. spacy-transformers, make sure the package is installed in your environment.

Available names: spacy.copy_from_base_model.v1

The customize_tokenizer_pathology is in functions.py that I fed into spacy train (see above). Even if I copy it to the current directory it's not recognized. Please advise where it should be copied.

svlandeg Jun 25, 2021

Can you try running prodigy train with -F functions.py?

DSLituiev Jun 26, 2021
Author

That worked for a little bit, but then failed

========================= Generating Prodigy config =========================
ℹ Using config from base model
✔ Generated training config

=========================== Initializing pipeline ===========================
[2021-06-26 09:01:22,956] [INFO] Set up nlp object from config
Components: ner
Merging training and evaluation data for 1 components
  - [ner] Training: 2091 | Evaluation: 522 (20% split)
Training: 693 | Evaluation: 218
Labels: ner (8)
[2021-06-26 09:01:24,346] [INFO] Pipeline: ['tok2vec', 'ner']
[2021-06-26 09:01:24,346] [INFO] Resuming training for: ['ner', 'tok2vec']
tokenizer customized
[2021-06-26 09:01:25,185] [INFO] Added vocab lookups: lexeme_norm
[2021-06-26 09:01:25,185] [INFO] Created vocabulary
[2021-06-26 09:01:25,185] [INFO] Finished initializing nlp object
[2021-06-26 09:01:25,185] [INFO] Initialized pipeline components: []
✔ Initialized pipeline

============================= Training pipeline =============================
Components: ner
Merging training and evaluation data for 1 components
  - [ner] Training: 2091 | Evaluation: 522 (20% split)
Training: 693 | Evaluation: 218
Labels: ner (8)
ℹ Pipeline: ['tok2vec', 'ner']
ℹ Initial learn rate: 0.0004
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00      2.06   59.53   60.27   58.80    0.60
⚠ Aborting and saving the final best model. Encountered exception:
ValueError()
Traceback (most recent call last):
  File "/Applications/anaconda3/envs/nlp/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Applications/anaconda3/envs/nlp/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Applications/anaconda3/envs/nlp/lib/python3.7/site-packages/prodigy/__main__.py", line 54, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 329, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/Applications/anaconda3/envs/nlp/lib/python3.7/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/Applications/anaconda3/envs/nlp/lib/python3.7/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/Applications/anaconda3/envs/nlp/lib/python3.7/site-packages/prodigy/recipes/train.py", line 245, in train
    config, output_dir=output_dir, gpu_id=gpu_id, overrides=overrides, silent=silent
  File "/Applications/anaconda3/envs/nlp/lib/python3.7/site-packages/prodigy/recipes/train.py", line 172, in _train
    spacy_train(nlp, output_path, use_gpu=gpu_id, stdout=stdout)
  File "/Applications/anaconda3/envs/nlp/lib/python3.7/site-packages/spacy/training/loop.py", line 115, in train
    raise e
  File "/Applications/anaconda3/envs/nlp/lib/python3.7/site-packages/spacy/training/loop.py", line 98, in train
    for batch, info, is_best_checkpoint in training_step_iterator:
  File "/Applications/anaconda3/envs/nlp/lib/python3.7/site-packages/spacy/training/loop.py", line 196, in train_while_improving
    subbatch, drop=dropout, losses=losses, sgd=False, exclude=exclude
  File "/Applications/anaconda3/envs/nlp/lib/python3.7/site-packages/spacy/language.py", line 1112, in update
    proc.update(examples, sgd=None, losses=losses, **component_cfg[name])
  File "spacy/pipeline/transition_parser.pyx", line 350, in spacy.pipeline.transition_parser.Parser.update
  File "spacy/pipeline/transition_parser.pyx", line 606, in spacy.pipeline.transition_parser.Parser._init_gold_batch
  File "spacy/pipeline/_parser_internals/transition_system.pyx", line 85, in spacy.pipeline._parser_internals.transition_system.TransitionSystem.get_oracle_sequence_from_state
  File "spacy/pipeline/_parser_internals/ner.pyx", line 310, in spacy.pipeline._parser_internals.ner.BiluoPushDown.set_costs
ValueError

I still wonder how to make sure my custom tokenizer is retained when I package the model (even beyond prodigy, as a package).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

making a trained model inherit custom tokenizer #8498

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

making a trained model inherit custom tokenizer #8498

Uh oh!

DSLituiev Jun 24, 2021

Replies: 1 comment · 6 replies

Uh oh!

svlandeg Jun 24, 2021

Uh oh!

Uh oh!

DSLituiev Jun 25, 2021 Author

Uh oh!

DSLituiev Jun 25, 2021 Author

Uh oh!

DSLituiev Jun 25, 2021 Author

Uh oh!

svlandeg Jun 25, 2021

Uh oh!

DSLituiev Jun 26, 2021 Author

DSLituiev
Jun 24, 2021

Replies: 1 comment 6 replies

svlandeg
Jun 24, 2021

DSLituiev Jun 25, 2021
Author

DSLituiev Jun 25, 2021
Author

DSLituiev Jun 25, 2021
Author

DSLituiev Jun 26, 2021
Author