Transferring TextCat model to a different dataset. #12400
-
I am new to NLP so I apologize if this question comes across as uninformed. I have trained a TextCat pipeline to recognize two classes on a large labeled dataset. It has achieved good accuracy of ~95%. I am now trying to transfer this model to classify the same two classes on a smaller dataset, which comes from a different source. From my expertise in machine learning, this is usually achieved by substituting/retraining the output layer of the model on the new dataset while freezing the middle layers. If my understanding is correct, then how can I achieve this in spaCy? Or should I choose a differet approach entirely? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Retraining the output layer is one of the possibilities that is worth trying. In spaCy, the hidden representations are made by the
With this change,
To reuse the https://spacy.io/usage/processing-pipelines#sourced-components If you have enough unannotated text from the domain of the smaller dataset, you could also try pretraining, which may give you better hidden representations that are tailored to that domain: |
Beta Was this translation helpful? Give feedback.
Retraining the output layer is one of the possibilities that is worth trying. In spaCy, the hidden representations are made by the
tok2vec
component in a pipeline (ortransformers
if you are using transformers). You can freeze the tok2vec weights by putting adding it to thefrozen_components
option of the training section:With this change,
tok2vec
is not called during training, so you need to addtok2vec
to the list of annotating components as well, so that thetextcat
pipe still gets hidden representations fromtok2vec
:To reuse the
tok2vec
parameters you can sourc…