Current recommended way to resume training V3 model? #9157
Replies: 1 comment 3 replies
-
Hey, sorry this is confusing. When people say they want to "resume training", there are typically two different things they mean. Your links cover both of these situations.
These are pretty different. In case 1, you can fix things simply by sourcing components, as described in #8176. I think you can override this with just command line params rather than maintaining a separate config. This is only something that I'd expect you need to do in model development, and even then not all that often, only when training is interrupted. If you're frequently resuming training could you explain your use case a little more? (I definitely think we could make this a lot easier, but it just hasn't been a priority.) In case 2, this is a fundamentally difficult problem in machine learning. There are a lot of problems but the basic one is that it's hard to balance the new data against the no-longer-available old data, and that results in catastrophic forgetting. We have an experimental "rehearsal" API to improve this, but the best thing is to train a model from scratch with all of the data.
This is just |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm relatively new to Spacy but have written a multi-label textcat using the (almost) excellent V3 documentation and able to init, train, save and load my model from the CLI. Also able to load and use the model for predictions on new data. However I cannot figure out the best way to resume training my model in V3. Right now it starts from scratch every time I run spacy train.
I've looked at #8176 #8598 #9078 as well as https://spacy.io/api/cli#pretraining and they mostly point to out of date V2 sample projects or don't actually answer the question. A sample project.yml file that generates a config.cfg for spacy train doesn't at all answer the question about resuming training. The most useful answer points to https://spacy.io/usage/processing-pipelines#sourced-components, for which I need to duplicate my config file and replace all pipeline component factories with sourced components. It seems stupid to me to maintain 2 almost identical config files (train.cfg and retrain.cfg) duplicating everything but factories/sources. I can't imagine people in the real world having to change settings in both files? If this is indeed the "correct" way then I would appreciate a slightly more elaborate example, because it is unclear from the responses above what to do with all the other pipeline component settings in the config file.
Perhaps there is a more "correct" modular way to reuse config files to minimize duplication. Or perhaps someone knows of a clever way to switch between factories/sources withing one config file based on a variable or command line option for train (similar to what pretraining provides)? Any suggestions would be welcome.
Unless I'm missing something, I may be so bold as to suggest new users would benefit from 2 small additions to the Getting Started section: 1) how to load my trained model to use for predictions, and 2) how to resume training for my model.
Beta Was this translation helpful? Give feedback.
All reactions