Best practice to keep the custom tokenizer and vocabulary when resuming training #7582
Replies: 1 comment 7 replies
-
Hi, I think the best place to do this is @registry.callbacks("copy_vocab_tokenizer")
def make_copy_vocab_tokenizer():
def copy_vocab_tokenizer(nlp):
other_nlp = spacy.load("en_ner_bc5cdr_md")
nlp.tokenizer.from_bytes(other_nlp.tokenizer.to_bytes())
nlp.vocab.from_bytes(other_nlp.vocab.to_bytes())
return copy_vocab_tokenizer [initialize]
[initialize.before_init]
@callbacks = "copy_vocab_tokenizer" This runs once when the model is initialized before training, so you have to have Copying the vocab like this would mean the vectors are copied twice (also once for It feels like |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Dear community,
Thank you very much for the amazing work on
spaCy
and especially theversion 3
.While migrating from
prodigy train
tospacy train
(spaCy 3), I stumbled on an issue with resumed training.The custom tokenizer and the custom vocabulary from the base pipeline are not kept in the updated pipeline.
How would you recommend to use
config.cfg
to keep the custom tokenizer and vocabulary from the base pipeline ?Thank you for your help!
Context
The base pipeline is from
scispaCy
, for exampleen_ner_bc5cdr_md
(v0.4.0). It does NER.The resumed training is for the NER component. The goal is to update the pipeline with more examples (same domain).
The pipeline uses a custom tokenizer and a custom vocabulary.
The tokenizer is added in their
config.cfg
here.The code for the tokenizer and the path to the vocabulary is provided to
spacy train
here.The vocabulary is created here.
I use
spacy train
and a dedicatedconfig.cfg
for resumed training (see next section).Finally, the version of
spaCy
is 3.0.5 and the one ofscispaCy
is 0.4.0.config.cfg
My
config.cfg
is fromspacy init config config.cfg -l en -p ner -o accuracy
with:pipeline
modified as follow:[components]
replaced with:frozen_components
modified as follow:vectors
in[initialize]
modified as follow:init_tok2vec
in[initialize]
modified as follow:Reproduce
The custom
tokenizer
from the base pipeline (e.g.en_ner_bc5cdr_md
) is not kept in the trained pipeline:The custom
vocabulary
from the base pipeline (e.g.en_ner_bc5cdr_md
) is not kept in the trained pipeline:Beta Was this translation helpful? Give feedback.
All reactions