Multiple pipelines with different, yet similar, datasets #11767
Replies: 1 comment 5 replies
-
Your example config is a reasonable thing to do. What that config would do is load the NER component from the pipeline saved at "model-best", load the tok2vec from that pipeline and embed it with that NER component, and include that in the pipeline you're training without trying to train it. The result will work the same, but what I would do in this case is instead just train the textcat and NER separately, and have a config that sources both the NER and textcat components. You can then use You can see an example config for this purpose in the coref project, though note in this case the transformers is included instead of using To explain a separate point, one decision that needs to be made about the tok2vec is whether you want a shared one or not. You have a couple of options here.
Since your example config uses Of the other strategies, 2 is easier to set up, and has the advantage that you only have one tok2vec, so it uses less memory and should be faster. The downside is that the component that trains with the frozen tok2vec usually doesn't work as well as if it could train the tok2vec, but the difference can be small. For strategy 1 it's a bit harder to set up the training data, but it has the advantages of 2 and usually without the decreased performance. To set up the training data you need to apply different annotations to the same data or explicitly mark annotations as "missing". In theory, the shared representations in strategy 1 can give an accuracy boost, but in practice it doesn't happen or is small. Given that, I usually recommend starting with strategy 3 (like you have here) and then moving to 2 or 1 if it's necessary for performance reasons. (See the docs on sharing embedding layers for more detail on the tradeoffs.)
To be clear, are you using one config for both your NER and textcat training? You can do that, but it would be easier to keep track of things if you have separate configs for each use case - that way you don't have to modify them just because you're in a different phase of training. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am training models that use ner, textcat and tok2vec. The ner and textcat data are different because the textcat is categorizing the sections of a larger document, each of which may or may not have ner entities. I have started experimenting with tok2vec as I migrate from spacy 2.3.
I understand the need to freeze one when training the other, or remove one if training a new model without either. My questions involves the user of the tok2vec under a couple of different scenarios.
When training a new, clean model, I start with ner and tok2vec and ignore/freeze textcat. If I were to add a textcat pipeline after training ner, based on the documentation, my config should include: Is this correct?
Now lets suppose I wish to resume training the ner after the textcat is trained. Does this remain in place? Does it revert back to just sourcing the model-best without freeze? Do I add similar freeze to newly frozen textcat? Since the textcat text is the same in smaller chunks, does any of this matter in this use case?
Any guidance you can provide is appreciated. Thanks
Beta Was this translation helpful? Give feedback.
All reactions