Multiple pipelines with different, yet similar, datasets #11767

python3Berg · 2022-11-07T13:34:19Z

python3Berg
Nov 7, 2022

I am training models that use ner, textcat and tok2vec. The ner and textcat data are different because the textcat is categorizing the sections of a larger document, each of which may or may not have ner entities. I have started experimenting with tok2vec as I migrate from spacy 2.3.

I understand the need to freeze one when training the other, or remove one if training a new model without either. My questions involves the user of the tok2vec under a couple of different scenarios.

When training a new, clean model, I start with ner and tok2vec and ignore/freeze textcat. If I were to add a textcat pipeline after training ner, based on the documentation, my config should include: Is this correct?
```
      [training]
      frozen_components = ["ner"]
      
      [components.ner]
      source = "model-best"
      replace_listeners = ["model.tok2vec"]
```
Now lets suppose I wish to resume training the ner after the textcat is trained. Does this remain in place? Does it revert back to just sourcing the model-best without freeze? Do I add similar freeze to newly frozen textcat? Since the textcat text is the same in smaller chunks, does any of this matter in this use case?

Any guidance you can provide is appreciated. Thanks

polm · 2022-11-08T03:31:24Z

polm
Nov 8, 2022

When training a new, clean model, I start with ner and tok2vec and ignore/freeze textcat. If I were to add a textcat pipeline after training ner, based on the documentation, my config should include: Is this correct?

Your example config is a reasonable thing to do. What that config would do is load the NER component from the pipeline saved at "model-best", load the tok2vec from that pipeline and embed it with that NER component, and include that in the pipeline you're training without trying to train it.

The result will work the same, but what I would do in this case is instead just train the textcat and NER separately, and have a config that sources both the NER and textcat components. You can then use spacy assemble to create a pipeline that includes both components without actually running training. In this case your config could have an empty frozen_components list and both components would use replace_listeners. I would do this instead of the pattern in your example config just because it seems simpler to me.

You can see an example config for this purpose in the coref project, though note in this case the transformers is included instead of using replace_listeners.

To explain a separate point, one decision that needs to be made about the tok2vec is whether you want a shared one or not. You have a couple of options here.

share one tok2vec and train textcat and NER at the same time
train tok2vec with NER, then freeze and re-use for textcat (or vice-versa)
train NER and textcat separately, so each has its own tok2vec

Since your example config uses replace_listeners, it's using strategy 3 here (and so does my assemble config example). This is the easiest thing to set up. A downside to this is that since you have tok2vecs, you are doing twice the work, but for CNN (non-transformer) models that's not necessarily a big deal.

Of the other strategies, 2 is easier to set up, and has the advantage that you only have one tok2vec, so it uses less memory and should be faster. The downside is that the component that trains with the frozen tok2vec usually doesn't work as well as if it could train the tok2vec, but the difference can be small.

For strategy 1 it's a bit harder to set up the training data, but it has the advantages of 2 and usually without the decreased performance. To set up the training data you need to apply different annotations to the same data or explicitly mark annotations as "missing".

In theory, the shared representations in strategy 1 can give an accuracy boost, but in practice it doesn't happen or is small. Given that, I usually recommend starting with strategy 3 (like you have here) and then moving to 2 or 1 if it's necessary for performance reasons. (See the docs on sharing embedding layers for more detail on the tradeoffs.)

Now lets suppose I wish to resume training the ner after the textcat is trained. Does this remain in place? Does it revert back to just sourcing the model-best without freeze? Do I add similar freeze to newly frozen textcat? Since the textcat text is the same in smaller chunks, does any of this matter in this use case?

To be clear, are you using one config for both your NER and textcat training? You can do that, but it would be easier to keep track of things if you have separate configs for each use case - that way you don't have to modify them just because you're in a different phase of training.

5 replies

python3Berg Nov 8, 2022
Author

Thanks Paul, this is an incredibly helpful and thorough response and has given me a lot to think about here.

I am using one config and then tweaking it in a script using the confection library to create a working copy I use only for a single training event. This way I can maintain a single config for the different approaches I am testing and not worry about the other items held constant for the various life cycle training events. It programmatically determines if various pipelines are present and perform actions depending on arguments passed. Probably overkill, but it also allows me to create automatic backups before I break something useful during experimentation.

I have a couple of followup questions...

The tok2vec is a separate pipeline. When I continue training an ner model (with or without textcat), does the definition of the ner.source provide everything needed to continue training tok2vec or is there a tok2vec source somewhere that also needs defining? I worry about things being reinitialized if not properly configured and setting me back.

Also, the text feeding my textcat and ner is basically the same...just one being a sub-section of the other. Does that mitigate the impact of "freezing"?

python3Berg Nov 8, 2022
Author

I think see part of the question answered in coref project. I could set my tok2vec source to same model-best to continue training...Is this correct? If I did replace listeners and have separate copies for the textcat and ner, would this change? How would we differentiate the two tok2vecs if they are separate copies under one "model-best"?

polm Nov 9, 2022

I am using one config and then tweaking it in a script using the confection library to create a working copy I use only for a single training event.

Got it, that makes sense.

About continuing training... I'm a little unclear what you mean by "continue training". What components exactly are you going to continue training?

We don't usually continue training components - if we get more data or something, we train again from scratch. This helps avoid issues like catastrophic forgetting or other calibration issues in the data. While you can resume training, currently it's more manual and tricky than we'd like because you have to either start from blank optimizer state or handle the optimizer serialization yourself, amongst some other issues. (It's on our todo list to improve this, but we're not working on it right now.)

In the coref project, what happens is:

clustering component ("coref") is trained totally normally, with a non-frozen transformer
the transformer is frozen and trained with a span resolver
those two components are assembled so there's only one transformer

If you want to do that then sourcing components will definitely work, since the sourced transformer is frozen. (Keeping in mind that transformers can't be frozen normally, so you need to use the workaround in #11547.) If you want to source the transformer but not freeze it, that will also work, but any component trained with the transformer before you source it won't work with it any more unless you train them at the same time.

As an example:

train transformer + NER → OK
source transformer, freeze, train textcat → OK, textcat and NER will both work and can share
continue training NER → the transformer changes, so it doesn't work with the textcat anymore

Alternate example:

train transformer + NER → OK
source transformer, don't freeze, train textcat → textcat works, NER no longer works (textcat changed)

I hope that clarifies things, let me know if I can clarify it more.

Also, the text feeding my textcat and ner is basically the same...just one being a sub-section of the other. Does that mitigate the impact of "freezing"?

Maybe somewhat, hard to say. The representations are formed not just by the input text, but also by the annotations - what words are important for textcat may be different from NER, or there may be a lot of overlap.

python3Berg Nov 11, 2022
Author

While my textcat is fairly static and also faster to train from scratch, ner is a different story. We have several sets of ner/textcat models, differentiated by the business/document types it learns from. The documents can be large...and while we do break them down into significantly smaller parts, the training can be quite time consuming...often days for ner. The approach I wish to use is the create the training data in multiple files with each new one representing new information that clients have created through document annotation and review. This allows me to continuously add new info, and ideally eliminates catastrophic forgetting since old info is still there. I will reach a theoretical max when new data adds no real value, but I am not there yet.

My textcat is getting very good results without toc2vec. Perhaps as you suggested I training the two pipelines separately and assemble them. Assemble documentation is unclear...Does one just specify the various pipelines with source=x, then magically they are combined? That would potentially be quite useful...

polm Nov 15, 2022

Thanks for the background on your continuous training, that does make sense. Unfortunately resuming is still not as easy as we would like it to be, and improving that is on our todo list.

For assemble, the best thing I can recommend is look at sample pipelines that use it; coref and double NER are probably the most complete examples. The assemble command basically runs the non-training parts of training, so you can use it to combine sourced components from multiple pipelines into a new pipeline. You can't just specify two pipelines, though I am working on a PR for something like that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Multiple pipelines with different, yet similar, datasets #11767

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Multiple pipelines with different, yet similar, datasets #11767

Uh oh!

Uh oh!

python3Berg Nov 7, 2022

Replies: 1 comment · 5 replies

Uh oh!

Uh oh!

polm Nov 8, 2022

Uh oh!

python3Berg Nov 8, 2022 Author

Uh oh!

python3Berg Nov 8, 2022 Author

Uh oh!

polm Nov 9, 2022

Uh oh!

python3Berg Nov 11, 2022 Author

Uh oh!

polm Nov 15, 2022

python3Berg
Nov 7, 2022

Replies: 1 comment 5 replies

polm
Nov 8, 2022

python3Berg Nov 8, 2022
Author

python3Berg Nov 8, 2022
Author

python3Berg Nov 11, 2022
Author