Improving multilable textcat using POS and NER #10137
-
I am training a 4 label textcat using a tranformer and in general I am getting excellent results. However, one of the categories is proving to be harder to differentiate from one of the others. I believe that if I tagged the words with POS and NER (I have done this with my own CNN based code written in pytorch) and then built the model I would improve my results. Using SPACY training I have setup the pipeline to include the parser and tagger and ner and freeze all three of them. But I am not sure how to annotate the training data to incorporate this information, can we do this from the config.cfg file directly or do I need to build my own custom code? Any pointers would be greatly appreciated. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Hello, Alternatively, I suggest that you have a look at your dataset and figure out if your "problematic label" is equally represented in the training/development set. |
Beta Was this translation helpful? Give feedback.
-
Thank you for your points! Unfortunately, I don't have a good balance between the problematic label training texts (it is a lot smaller) as they just don't occur in the wild as much. I was thinking about generating a set of synthetic data to even it out somewhat. Thank you for the information.
Michael R. Wade
Noetic Analytics LLC
Tel: +1 720.470.7365
Email: ***@***.***
…________________________________
From: Edward ***@***.***>
Sent: Wednesday, January 26, 2022 5:40 AM
To: explosion/spaCy ***@***.***>
Cc: Michael Wade ***@***.***>; Author ***@***.***>
Subject: Re: [explosion/spaCy] Improving multilable textcat using POS and NER (Discussion #10137)
Hello,
you can add your components (parser, tagger, ner) to the annotating_components variable in the config.cfg (https://spacy.io/api/data-formats#config) that will set their annotations during training. However, I'm unsure if this will affect the performance of your textcat.
Alternatively, I suggest that you have a look at your dataset and figure out if your "problematic label" is equally represented in the training/development set.
—
Reply to this email directly, view it on GitHub<#10137 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AS4L3SDJMEXWR5WHNKGWUNDUX7FSFANCNFSM5MY2EZTQ>.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
Hello,
you can add your components (parser, tagger, ner) to the
annotating_components
variable in theconfig.cfg
(https://spacy.io/api/data-formats#config) that will set their annotations during training. In order to have an effect on the performance you have to use a tok2vec instead of a transformer and add the new features to theattrs
variable under[components.textcat.model.tok2vec.embed]
However, I'm unsure if this will have a big impact on your textcat.
Alternatively, I suggest that you have a look at your dataset and figure out if your "problematic label" is equally represented in the training/development set.