Transferring TextCat model to a different dataset. #12400

wwpobt · 2023-03-12T00:35:45Z

wwpobt
Mar 12, 2023

I am new to NLP so I apologize if this question comes across as uninformed.

I have trained a TextCat pipeline to recognize two classes on a large labeled dataset. It has achieved good accuracy of ~95%. I am now trying to transfer this model to classify the same two classes on a smaller dataset, which comes from a different source. From my expertise in machine learning, this is usually achieved by substituting/retraining the output layer of the model on the new dataset while freezing the middle layers. If my understanding is correct, then how can I achieve this in spaCy? Or should I choose a differet approach entirely?

Answered by danieldk

Mar 14, 2023

Retraining the output layer is one of the possibilities that is worth trying. In spaCy, the hidden representations are made by the tok2vec component in a pipeline (or transformers if you are using transformers). You can freeze the tok2vec weights by putting adding it to the frozen_components option of the training section:

[training]
frozen_components = ["tok2vec"]

With this change, tok2vec is not called during training, so you need to add tok2vec to the list of annotating components as well, so that the textcat pipe still gets hidden representations from tok2vec:

[training]
frozen_components = ["tok2vec"]
annotating_components = ["tok2vec"]

To reuse the tok2vec parameters you can sourc…

View full answer

danieldk · 2023-03-14T09:14:36Z

danieldk
Mar 14, 2023

Retraining the output layer is one of the possibilities that is worth trying. In spaCy, the hidden representations are made by the tok2vec component in a pipeline (or transformers if you are using transformers). You can freeze the tok2vec weights by putting adding it to the frozen_components option of the training section:

[training]
frozen_components = ["tok2vec"]

With this change, tok2vec is not called during training, so you need to add tok2vec to the list of annotating components as well, so that the textcat pipe still gets hidden representations from tok2vec:

[training]
frozen_components = ["tok2vec"]
annotating_components = ["tok2vec"]

To reuse the tok2vec parameters you can source them from the model that you already trained (see the example on the right hand side for how this looks in the configuration):

https://spacy.io/usage/processing-pipelines#sourced-components

If you have enough unannotated text from the domain of the smaller dataset, you could also try pretraining, which may give you better hidden representations that are tailored to that domain:

https://spacy.io/usage/embeddings-transformers#pretraining

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Transferring TextCat model to a different dataset. #12400

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Transferring TextCat model to a different dataset. #12400

Uh oh!

wwpobt Mar 12, 2023

Replies: 1 comment

Uh oh!

danieldk Mar 14, 2023

wwpobt
Mar 12, 2023

danieldk
Mar 14, 2023