Current recommended way to resume training V3 model? #9157

djmechanic · 2021-09-06T15:15:37Z

djmechanic
Sep 6, 2021

I'm relatively new to Spacy but have written a multi-label textcat using the (almost) excellent V3 documentation and able to init, train, save and load my model from the CLI. Also able to load and use the model for predictions on new data. However I cannot figure out the best way to resume training my model in V3. Right now it starts from scratch every time I run spacy train.

I've looked at #8176 #8598 #9078 as well as https://spacy.io/api/cli#pretraining and they mostly point to out of date V2 sample projects or don't actually answer the question. A sample project.yml file that generates a config.cfg for spacy train doesn't at all answer the question about resuming training. The most useful answer points to https://spacy.io/usage/processing-pipelines#sourced-components, for which I need to duplicate my config file and replace all pipeline component factories with sourced components. It seems stupid to me to maintain 2 almost identical config files (train.cfg and retrain.cfg) duplicating everything but factories/sources. I can't imagine people in the real world having to change settings in both files? If this is indeed the "correct" way then I would appreciate a slightly more elaborate example, because it is unclear from the responses above what to do with all the other pipeline component settings in the config file.

Perhaps there is a more "correct" modular way to reuse config files to minimize duplication. Or perhaps someone knows of a clever way to switch between factories/sources withing one config file based on a variable or command line option for train (similar to what pretraining provides)? Any suggestions would be welcome.

Unless I'm missing something, I may be so bold as to suggest new users would benefit from 2 small additions to the Getting Started section: 1) how to load my trained model to use for predictions, and 2) how to resume training for my model.

polm · 2021-09-07T05:01:31Z

polm
Sep 7, 2021

Hey, sorry this is confusing.

When people say they want to "resume training", there are typically two different things they mean. Your links cover both of these situations.

A training process was killed and they want to restart it from where it was killed, with the same data (Is It Possible to Resume Training Via CLI in Spacy v3 (transformers)? #8176, pretraining)
They want to train an existing model on more (different) data (the other things you linked to)

These are pretty different.

In case 1, you can fix things simply by sourcing components, as described in #8176. I think you can override this with just command line params rather than maintaining a separate config. This is only something that I'd expect you need to do in model development, and even then not all that often, only when training is interrupted. If you're frequently resuming training could you explain your use case a little more? (I definitely think we could make this a lot easier, but it just hasn't been a priority.)

In case 2, this is a fundamentally difficult problem in machine learning. There are a lot of problems but the basic one is that it's hard to balance the new data against the no-longer-available old data, and that results in catastrophic forgetting. We have an experimental "rehearsal" API to improve this, but the best thing is to train a model from scratch with all of the data.

how to load my trained model to use for predictions

This is just spacy.load("path/to/my-model"). I'll look at adding a note about that to the end of the quickstart.

3 replies

djmechanic Sep 7, 2021
Author

Thank you for the quick and helpful response.

I'm not too bothered about case 1 as it hasn't happened to me yet and given my current <1 day training cycles I'd just restart. But in future if I've lost days of training time I'd need a better solution. The docs/examples could elaborate a bit more to clarify exactly how to make this change: a) if in the config file I replace the factory line with a source line, can I leave everything else as is ... what about all the pipeline component sections below? or b) if on the command line how to I override a factory option with a source option, seeing as they're named differently?

You're correct that my real interest is case 2 to train an existing model on different or new data. I've trained it on dataset A and it works fine but I've prepared an additional dataset B and may want to use dataset C in future (to enrich or augment the data the model has seen). Of course I could combine A+B+C into a single dataset and periodically retrain the model on everything but it seems to me the data+training time would snowball.

Similarly, I want to teach my model about new cases it may have gotten wrong: we run some data through it, the user says hang on you got row 42 wrong, and we "freshen up" the model with row 42. I can't imagine retraining the entire model for days on datasets A+B+C+row 42. Is this really what people are doing in practice?

Cheers for the response.

polm Sep 9, 2021

Similarly, I want to teach my model about new cases it may have gotten wrong: we run some data through it, the user says hang on you got row 42 wrong, and we "freshen up" the model with row 42. I can't imagine retraining the entire model for days on datasets A+B+C+row 42. Is this really what people are doing in practice?

In my experience generally you wait a week or so and you have (tens of) thousands of new data points, and then you retrain with all data.

You really cannot update a neural net with just one new example. In the pre-neural days there was "online learning", where updating the model one example at a time was normal, but that was possible due to a completely different architecture. I think some people have experimented with online updates for neural architectures but it's not widespread.

For the really big models that take a while to train, what you can do, especially for language, is have a big generic model like BERT and fine-tune it more frequently with a smaller data set. That's basically the way training with a pretrained tok2vec/transformer with spaCy works. If you have a really big data set maybe you can train a base and update it occasionally, but you still need to fine-tune with a reasonable data set and not a tiny diff.

Another option is that if you use CPU models they train quite quickly. It can be beneficial to use small CPU models while you figure out the overall flow of your application and get initial feedback, so you can change things and retrain quickly, and swap them for larger models after things have settled down a bit.

Alternately, if something like row 42 in your example is really important, you can use rules to patch specific cases, which give you a lot more control, but can't extrapolate.

About resuming with command line parameters, let me look at it, but note that even with overrides it won't take care of learning rate schedules, for example, so it's still less than ideal.

polm Sep 16, 2021

Sorry for taking a while to get back to you on this - it turns out it's not easy to resume training from the command line with overrides. The simplest thing is to make a different config.

I understand that's inconvenient but it is something we want to improve going forward.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Current recommended way to resume training V3 model? #9157

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Current recommended way to resume training V3 model? #9157

Uh oh!

djmechanic Sep 6, 2021

Replies: 1 comment · 3 replies

Uh oh!

polm Sep 7, 2021

Uh oh!

Uh oh!

djmechanic Sep 7, 2021 Author

Uh oh!

polm Sep 9, 2021

Uh oh!

polm Sep 16, 2021

djmechanic
Sep 6, 2021

Replies: 1 comment 3 replies

polm
Sep 7, 2021

djmechanic Sep 7, 2021
Author