Multiple textcat_multilabel components in same spacy v3 pipeline #7498

drisspg · 2021-03-19T16:41:09Z

drisspg
Mar 19, 2021

Hey I am trying to migrate an exisiting spacy 2.0 NLP pipeline to 3.0. Our current pipeline consists of a trained NER and two multi-label, TextCategorizer components. We originally had three custom training scripts. The script for training the NER component and TextCat-1 component utilized the same dataset. The TextCat-2 component was trained on a different dataset. The training scripts would disable all but the current pipeline component for training.

I was able to convert both existing datasets to the new DocBin format. I have been successful training the NER and TextCat-1 using the python -m spacy train --paths.dev <> --paths.train <>
My current pipeline config is pipeline = ["tok2vec","ner","textcat_multilabel"] I am not total sure what my pipeline structure should be for this.

Should I create custom components that use the "textcat_multilabel" factory and then train both separately freezing_components and specifying data paths through the CLI options. If so is there a best practice in terms of what my training config should be.

adrianeboyd · 2021-03-22T10:17:18Z

adrianeboyd
Mar 22, 2021

It is possible to do this with two configs where the second one sources all the components from the first config and freezes them while training with the second dataset, but it can be easier to train two separate models without frozen components and then have a collate script that combines the two models. This is what we do for all the pretrained models like en_core_web_sm. It would look something like this:

import spacy

nlp1 = spacy.load("model1") # ["tok2vec","ner","textcat_multilabel"]
nlp2 = spacy.load("model2") # ["textcat_multilabel"]

nlp1.add_pipe("textcat_multilabel", name="textcat_multilabel2", source=nlp2)

nlp1.to_disk("combined_model")

If your second textcat_multilabel listens to a tok2vec component (i.e., you have ["tok2vec", "textcat_multilabel"] with a spacy.Tok2VecListener.v1), you can copy the tok2vec component into the textcat_multilabel component before sourcing with replace_listeners:

nlp2.get_pipe("textcat_multilabel").replace_listeners("tok2vec", "textcat_multilabel", ["model.tok2vec"])
nlp1.add_pipe("textcat_multilabel", "textcat_multilabel2", source=nlp2)

(Just like you can have textcat_multilabel and textcat_multilabel2, you could also have tok2vec and tok2vec2, but it's more complicated to source components with separate listeners. You would want to keep a separate tok2vec2 if you have multiple components in nlp2 that listen to the same tok2vec, since you wouldn't want to duplicate the weights/processing in the combined pipeline. But if you just have one component that listens to the tok2vec, there's no benefit to keeping it separate.)

2 replies

drisspg Apr 7, 2021
Author

Is it even possible to have a "textcat_multilabel" without a "tok2vec" component. I have tried to train a pipeline with "tok2vec" as a frozen_component and I get his error:

WARNING] [W086] Component 'textcat_multilabel' will be (re)trained, but it needs the component 'tok2vec' which is frozen. 
You can either freeze both, or neither of the two. If you are sourcing the component from an existing pipeline, you
 can use the `replace_listeners` setting in the config block to replace its token-to-vector listener with a copy and make it independent. For example, `replace_listeners = ["model.tok2vec"]

So if I were to use the first suggestion would It copy in the new "textcat_multilabel_2" architecture and populate its param weights and rewire its tok2veclisitener to the nlp1's tok2vec component? I think this is preferable in my situation as opposed to the second which will make a copy of the embedding layer just for "texcat_multilabel_2".

If my pipeline was:
pipeline = ["tok2vec","ner","textcat_multilabel","textcat_multilabel_2"]

and on first training run I passed in
python -m spacy train --training.frozen_components textcat_multilabel_2

on on second training with different data run
python -m spacy train --training.frozen_components ner, texcat_multilabel

So that the tok2vec layer gets updated for both runs ?

drisspg Apr 7, 2021
Author

It appears though that list arguments can't be passed into the CLI at least I haven't been able to figure out the syntax

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Multiple textcat_multilabel components in same spacy v3 pipeline #7498

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Multiple textcat_multilabel components in same spacy v3 pipeline #7498

Uh oh!

drisspg Mar 19, 2021

Replies: 1 comment · 2 replies

Uh oh!

Uh oh!

adrianeboyd Mar 22, 2021

Uh oh!

drisspg Apr 7, 2021 Author

Uh oh!

drisspg Apr 7, 2021 Author

drisspg
Mar 19, 2021

Replies: 1 comment 2 replies

adrianeboyd
Mar 22, 2021

drisspg Apr 7, 2021
Author

drisspg Apr 7, 2021
Author