Best practice to keep the custom tokenizer and vocabulary when resuming training #7582

pafonta · 2021-03-26T17:57:47Z

pafonta
Mar 26, 2021

Dear community,

Thank you very much for the amazing work on spaCy and especially the version 3.

While migrating from prodigy train to spacy train (spaCy 3), I stumbled on an issue with resumed training.
The custom tokenizer and the custom vocabulary from the base pipeline are not kept in the updated pipeline.

How would you recommend to use config.cfg to keep the custom tokenizer and vocabulary from the base pipeline ?

Thank you for your help!

Context

The base pipeline is from scispaCy, for example en_ner_bc5cdr_md (v0.4.0). It does NER.
The resumed training is for the NER component. The goal is to update the pipeline with more examples (same domain).

The pipeline uses a custom tokenizer and a custom vocabulary.
The tokenizer is added in their config.cfg here.
The code for the tokenizer and the path to the vocabulary is provided to spacy train here.
The vocabulary is created here.

I use spacy train and a dedicated config.cfg for resumed training (see next section).

Finally, the version of spaCy is 3.0.5 and the one of scispaCy is 0.4.0.

config.cfg

My config.cfg is from spacy init config config.cfg -l en -p ner -o accuracy with:

pipeline modified as follow:

pipeline = ["tok2vec", "tagger", "attribute_ruler", "lemmatizer", "parser", "ner"]

The content of [components] replaced with:

# One block for each 'xyz' in 'pipeline'.
[components.xyz]
source = en_ner_bc5cdr_md

frozen_components modified as follow:

frozen_components =  ["tok2vec", "tagger", "attribute_ruler", "lemmatizer", "parser"]

vectors in [initialize] modified as follow:

vectors = en_ner_bc5cdr_md

init_tok2vec in [initialize] modified as follow:

init_tok2vec = null

Reproduce

The custom tokenizer from the base pipeline (e.g. en_ner_bc5cdr_md) is not kept in the trained pipeline:

import spacy

text = 'in the Gq/G11 protein'
expected = ['in', 'the', 'Gq/G11', 'protein']

base = spacy.load('en_ner_bc5cdr_md')
[x.text for x in base(text)] == expected
# True

trained = spacy.load('./trained/model/')
[x.text for x in trained(text)] == expected
# False but should be True

The custom vocabulary from the base pipeline (e.g. en_ner_bc5cdr_md) is not kept in the trained pipeline:

import spacy

base = spacy.load('en_ner_bc5cdr_md')
trained = spacy.load('./trained/model/')

# Vocabulary elements only in the base pipeline.
len(set(base.vocab) - set(trained.vocab))
# 40 but should be 0 (unless I missed something).

adrianeboyd · 2021-03-26T20:33:51Z

adrianeboyd
Mar 26, 2021

Hi, I think the best place to do this is initialize.before_init. With the caveat that this code is untested, but more or less:

@registry.callbacks("copy_vocab_tokenizer")
def make_copy_vocab_tokenizer():
    def copy_vocab_tokenizer(nlp):
        other_nlp = spacy.load("en_ner_bc5cdr_md")
        nlp.tokenizer.from_bytes(other_nlp.tokenizer.to_bytes())
        nlp.vocab.from_bytes(other_nlp.vocab.to_bytes())
    return copy_vocab_tokenizer

[initialize]

[initialize.before_init]
@callbacks = "copy_vocab_tokenizer"

This runs once when the model is initialized before training, so you have to have en_ner_bc5cdr_md installed, but later users of the model do not. You can make the model name a parameter of the callback if you'd like to do this more generally.

Copying the vocab like this would mean the vectors are copied twice (also once for initialize.vectors), so I think you could omit initialize.vectors ="en_ner_bc5cdr_md", but double-check to be sure the vectors end up like you expect. The other option is to use nlp.vocab.from_bytes(other_nlp.vocab.to_bytes(exclude=["vectors"])), I think the end result will be the same either way, and if you let it copy the vectors twice, it's just a matter of 10 s or so on init, so most likely not a huge issue.

It feels like [initialize] needs a source kind of option to load the vocab and tokenizer from an existing model just like the other components. But maybe we can just provide a registered function that does this called something like spacy.source_vocab.v1.

7 replies

pafonta Mar 29, 2021
Author

Thank you for your quick and detailed answer!

For the vectors, in case you or others would be interested, the check succeeds:

base.vocab.vectors.to_bytes() == trained.vocab.vectors.to_bytes()
# True

Overall there should be easier options, though.

👍

If it could help, I was otherwise thinking that adding an option to spacy init config to declare a source pipeline and component(s) to update could also help. Like having mirror arguments for --language and --pipeline. For example, one would be able to generate the config.cfg described above with the vocabulary and tokenizer copy just with:

spacy init config config.cfg --source en_ner_bc5cdr_md --update ner

Thank you again for your help to bring a solution to this discussion!

adrianeboyd Mar 29, 2021

The more options we have here the more this turns back into the hard-to-maintain v2 spacy train script, which is something we're trying to avoid, and the "base model" concept also had some drawbacks in terms of obscuring where things like tokenizer settings came from.

But we can definitely consider whether additional init config options might be a good way to make this a bit easier since figuring out how to fine-tune models from configs is too difficult right now.

pafonta Mar 30, 2021
Author

I understand :)

Maybe advertising the great thing you do here ⬇️ could help people in the meantime?

https://github.com/explosion/projects/blob/v3/pipelines/ner_demo_update/scripts/create_config.py

adrianeboyd Apr 13, 2021

pafonta Apr 14, 2021
Author

Hi @adrianeboyd!

That's awesome! Thank you very much :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Best practice to keep the custom tokenizer and vocabulary when resuming training #7582

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 7 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Best practice to keep the custom tokenizer and vocabulary when resuming training #7582

Uh oh!

pafonta Mar 26, 2021

Replies: 1 comment · 7 replies

Uh oh!

adrianeboyd Mar 26, 2021

Uh oh!

pafonta Mar 29, 2021 Author

Uh oh!

adrianeboyd Mar 29, 2021

Uh oh!

pafonta Mar 30, 2021 Author

Uh oh!

adrianeboyd Apr 13, 2021

Uh oh!

pafonta Apr 14, 2021 Author

pafonta
Mar 26, 2021

Replies: 1 comment 7 replies

adrianeboyd
Mar 26, 2021

pafonta Mar 29, 2021
Author

pafonta Mar 30, 2021
Author

pafonta Apr 14, 2021
Author