textcat works or works better without balancing classes? #10089

astrouhiu · 2022-01-19T04:53:24Z

astrouhiu
Jan 19, 2022

Hello. I would like to ask three questions about "textcat":

spaCy manages to deal with unbalanced classes?
Does it need or work better pre-balancing classes?

I created two models of "textcat": 1) considering all the examples, 2) performing undersampling by removing examples from the majority classes randomly.

Considering all the data, the target column would have 3 categories distributed as follows: category 'A' has 3368 examples, category 'B' has 647 and category 'C' has 91; totaling 4106 examples.
Carrying out undersampling, it would have: category 'A' with 100 examples, category 'B' with 100 examples and category 'C' with 91 examples, totaling 291 examples.

I then split 70% of the data for training and 30% for testing. For the sake of curiosity I compared the metrics for model evaluation in the "meta.json" file with the metrics that would be given using sklearn's "classification_report" function (using spaCy predictions on y_test). In case 1) the results are identical, in case 2) the results are different.

My third question is if this is expected? Why are the results different in the case of undersampling?

Thanks a lot!

Answered by ljvmiranda921

Jan 19, 2022

Hi @astrouhiu ,

There are no built-in / special functions for automatically handling imbalanced data. Although there is such thing as data augmentation to help improve your results.
Generally it's better if you can pre-balance your classes. You might also want to try augmenting your data instead of discarding useful samples.
For the last question, I'm curious how different are the results? How large is the discrepancy?

View full answer

ljvmiranda921 · 2022-01-19T05:21:01Z

ljvmiranda921
Jan 19, 2022

Hi @astrouhiu ,

There are no built-in / special functions for automatically handling imbalanced data. Although there is such thing as data augmentation to help improve your results.
Generally it's better if you can pre-balance your classes. You might also want to try augmenting your data instead of discarding useful samples.
For the last question, I'm curious how different are the results? How large is the discrepancy?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

textcat works or works better without balancing classes? #10089

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

textcat works or works better without balancing classes? #10089

Uh oh!

Uh oh!

astrouhiu Jan 19, 2022

Replies: 1 comment

Uh oh!

Uh oh!

ljvmiranda921 Jan 19, 2022

astrouhiu
Jan 19, 2022

ljvmiranda921
Jan 19, 2022