Programatic retraining of existing model #9682

JFulweber · 2021-11-16T21:07:51Z

JFulweber
Nov 16, 2021

I am trying to make a service that stores training data, compiles it to a spacy binary, loads the appropriate associated model, and then trains using that data. In order to train I am calling spacy.cli.train.train and passing the appropriate parameters in version 3.1.4. I am testing this with a model which is trained from en_core_web_sm, and in the models config, components.ner looks like this:

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
update_with_oracle_cut_size = 100

In this case I want to replace factory = "ner" with source = "<some model path>" and component="ner", but I cannot do this using configuration overrides, as factory="ner" will still be present. I have seen things discouraging modification of config's after training as it can invalidate some things, so is this an okay approach for this, or is there a better way of programmatically training like this?

Answered by polm

Nov 17, 2021

I'm a little unclear on what parts you're actually retraining here. I guess the sourced NER component is something you've trained but want to train more?

You're right that you can't do this with config overrides, you'll need to rewrite the config. You can load the config into a dict, modify it, and write it out if you need to change part like that.

What will work here depends on how the tok2vec is configured. If your NER component is using a listener then this just won't work because the NER component will have been trained with a different tok2vec than the one in your pipeline (unless you're also sourcing the tok2vec I guess). If it's using an embedded tok2vec then I think that might wor…

View full answer

polm · 2021-11-17T04:10:32Z

polm
Nov 17, 2021

I'm a little unclear on what parts you're actually retraining here. I guess the sourced NER component is something you've trained but want to train more?

You're right that you can't do this with config overrides, you'll need to rewrite the config. You can load the config into a dict, modify it, and write it out if you need to change part like that.

What will work here depends on how the tok2vec is configured. If your NER component is using a listener then this just won't work because the NER component will have been trained with a different tok2vec than the one in your pipeline (unless you're also sourcing the tok2vec I guess). If it's using an embedded tok2vec then I think that might work, but keep in mind you need to have the same vectors etc.

Are you just training an NER component? If so you shouldn't need to source the previously trained one. You can technically retrain an NER component, but we recommend against it due to catastrophic forgetting (see the FAQ).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Programatic retraining of existing model #9682

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Programatic retraining of existing model #9682

Uh oh!

JFulweber Nov 16, 2021

Replies: 1 comment

Uh oh!

polm Nov 17, 2021

JFulweber
Nov 16, 2021

polm
Nov 17, 2021