Improving multilable textcat using POS and NER #10137

mwade-noetic · 2022-01-25T17:04:10Z

mwade-noetic
Jan 25, 2022

I am training a 4 label textcat using a tranformer and in general I am getting excellent results. However, one of the categories is proving to be harder to differentiate from one of the others. I believe that if I tagged the words with POS and NER (I have done this with my own CNN based code written in pytorch) and then built the model I would improve my results.

Using SPACY training I have setup the pipeline to include the parser and tagger and ner and freeze all three of them. But I am not sure how to annotate the training data to incorporate this information, can we do this from the config.cfg file directly or do I need to build my own custom code?

Any pointers would be greatly appreciated.

Answered by thomashacker

Jan 26, 2022

Hello,
you can add your components (parser, tagger, ner) to the annotating_components variable in the config.cfg (https://spacy.io/api/data-formats#config) that will set their annotations during training. In order to have an effect on the performance you have to use a tok2vec instead of a transformer and add the new features to the attrs variable under [components.textcat.model.tok2vec.embed]
However, I'm unsure if this will have a big impact on your textcat.

Alternatively, I suggest that you have a look at your dataset and figure out if your "problematic label" is equally represented in the training/development set.

View full answer

thomashacker · 2022-01-26T10:40:23Z

thomashacker
Jan 26, 2022

Hello,
you can add your components (parser, tagger, ner) to the annotating_components variable in the config.cfg (https://spacy.io/api/data-formats#config) that will set their annotations during training. In order to have an effect on the performance you have to use a tok2vec instead of a transformer and add the new features to the attrs variable under [components.textcat.model.tok2vec.embed]
However, I'm unsure if this will have a big impact on your textcat.

Alternatively, I suggest that you have a look at your dataset and figure out if your "problematic label" is equally represented in the training/development set.

0 replies

mwade-noetic · 2022-01-26T13:34:14Z

mwade-noetic
Jan 26, 2022
Author

Thank you for your points! Unfortunately, I don't have a good balance between the problematic label training texts (it is a lot smaller) as they just don't occur in the wild as much. I was thinking about generating a set of synthetic data to even it out somewhat. Thank you for the information. Michael R. Wade Noetic Analytics LLC Tel: +1 720.470.7365 Email: ***@***.***

…

________________________________ From: Edward ***@***.***> Sent: Wednesday, January 26, 2022 5:40 AM To: explosion/spaCy ***@***.***> Cc: Michael Wade ***@***.***>; Author ***@***.***> Subject: Re: [explosion/spaCy] Improving multilable textcat using POS and NER (Discussion #10137) Hello, you can add your components (parser, tagger, ner) to the annotating_components variable in the config.cfg (https://spacy.io/api/data-formats#config) that will set their annotations during training. However, I'm unsure if this will affect the performance of your textcat. Alternatively, I suggest that you have a look at your dataset and figure out if your "problematic label" is equally represented in the training/development set. — Reply to this email directly, view it on GitHub<#10137 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AS4L3SDJMEXWR5WHNKGWUNDUX7FSFANCNFSM5MY2EZTQ>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improving multilable textcat using POS and NER #10137

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Improving multilable textcat using POS and NER #10137

Uh oh!

mwade-noetic Jan 25, 2022

Replies: 2 comments

Uh oh!

Uh oh!

thomashacker Jan 26, 2022

Uh oh!

mwade-noetic Jan 26, 2022 Author

mwade-noetic
Jan 25, 2022

thomashacker
Jan 26, 2022

mwade-noetic
Jan 26, 2022
Author