Training a TextCat with different weights between FP and FN. #11256

Jarathael · 2022-08-02T15:58:55Z

Jarathael
Aug 2, 2022

Hi everyone !
I'm currently working a classification project at my job using NLP and I want to try a very basic model using SpaCy TextCat.

I used the quick configuration set up from here https://spacy.io/usage/training#quickstart with English / textcat / CPU / accuracy.

One of my first problem is as follow. In the base configuration given with the tool, it seems that the Tok2Vec component is created from scratch which is not something I particularly want since I would prefer to use the already trained tok2vec from en_core_web_lg.
So I added:

[paths]
init_tok2vec = "en_core_web_lg"

[initialize]
init_tok2vec = ${paths.init_tok2vec}

I'm just not sure this way of doing things is what I want to do.

The second thing I don't know how to modify is on what TextCat is optimized. In my classification context I want to be more careful about not making FPs so I would want to weigh the precision more than I weigh the recall using F-beta score instead of F1 score !
Is that something we can configure in some ways ? Or is there a way to optimize the classification threshold based on that assumption without having to write the optimization loop ?

If you need any more details, I'll be happy to give them to you.
Thank you in advance for your help :)

Answered by adrianeboyd

Aug 3, 2022

What you should probably start with:

Use "accuracy" in the quickstart to use the static word vectors from en_core_web_lg, with no further changes to the config.
You do want a separate tok2vec for a textcat component. Don't use the tok2vec from en_core_web_lg.

After training, you could try out spacy report (https://spacy.io/universe/project/spacy-report) to experiment with the threshold. The threshold is only used for scoring, it doesn't affect the training process itself or the annotations saved to doc.cats, which are always the scores for all categories.

There's a bit of confusing duplication in the settings, so to modify the threshold after training if you want to use spacy evaluate w…

View full answer

adrianeboyd · 2022-08-03T06:41:41Z

adrianeboyd
Aug 3, 2022

What you should probably start with:

Use "accuracy" in the quickstart to use the static word vectors from en_core_web_lg, with no further changes to the config.
You do want a separate tok2vec for a textcat component. Don't use the tok2vec from en_core_web_lg.

After training, you could try out spacy report (https://spacy.io/universe/project/spacy-report) to experiment with the threshold. The threshold is only used for scoring, it doesn't affect the training process itself or the annotations saved to doc.cats, which are always the scores for all categories.

There's a bit of confusing duplication in the settings, so to modify the threshold after training if you want to use spacy evaluate with the built-in scoring, you need to modify nlp.get_pipe("textcat").cfg["threshold"].

4 replies

Jarathael Aug 3, 2022
Author

Thank you for the answer !

I don't understand why I would want a separate tok2vec for a textcat. I mean I understand the fact that if I have other components in my SpaCy pipeline I would need to have a separate tok2vec if I am training it along with textcat to avoid degrading the performance on other components. For now I just have the textcat component.
Isn't it better to use a pretrained tok2vec ?

So for the threshold I just have to make my own function to optimize to get the highest F-beta score possible ! It's all good then :)

adrianeboyd Aug 3, 2022

With one "listening" component it doesn't really make a difference whether you have a separate tok2vec (similar to tagger in en_core_web_lg) or an internal tok2vec (similar to ner in en_core_web_lg). The quickstart will only generate the option with the separate tok2vec, but you can manually edit the config before training or use nlp.replace_listeners to convert it to an internal tok2vec after training.

It's useful to use the static word vectors from en_core_web_lg, but the pretrained tok2vec from en_core_web_lg has been fine-tuned for tagging/parsing and in our experience isn't really useful for textcat vs. starting from a new one.

Jarathael Aug 3, 2022
Author

I understand ! Thank you for your explanation.
Maybe one last question: what are static word vector in en_core_web_lg ? And should I activate them ?

adrianeboyd Aug 4, 2022

Choosing "accuracy" in the quickstart should already configure the static vectors correctly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Training a TextCat with different weights between FP and FN. #11256

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Training a TextCat with different weights between FP and FN. #11256

Uh oh!

Uh oh!

Jarathael Aug 2, 2022

Replies: 1 comment · 4 replies

Uh oh!

adrianeboyd Aug 3, 2022

Uh oh!

Jarathael Aug 3, 2022 Author

Uh oh!

adrianeboyd Aug 3, 2022

Uh oh!

Jarathael Aug 3, 2022 Author

Uh oh!

adrianeboyd Aug 4, 2022

Jarathael
Aug 2, 2022

Replies: 1 comment 4 replies

adrianeboyd
Aug 3, 2022

Jarathael Aug 3, 2022
Author

Jarathael Aug 3, 2022
Author