Text Classification Model over hundreds of labels #10204

LeoAlvesRodrigues · 2022-02-03T15:53:34Z

LeoAlvesRodrigues
Feb 3, 2022

Hello, I am working on a project and we are trying to build a model that will classify text over 700 labels (more or less). We are starting small with only a few labels and I want to know to what degree are other pipeline components useful for this type of project and magnitude. We are also thinking if creating several models with only a few models is better than creating a single model with all labels implemented.

Any information will be appreciated,
Thank you in advance.

Have a good day!

pmbaumgartner · 2022-02-11T00:48:22Z

pmbaumgartner
Feb 11, 2022

Hey @LeoAlvesRodrigues, thanks for the question.

The textcat has no arbitrary limit on labels, but memory use scales with number of labels, so splitting up components might make sense if you have unrelated groups of categories.

We have plans to make a hierarchical textcat component, which might be suitable depending on how your categories are structured, but we are not working on it now.

Without knowing more about your specific data, I would say prioritize labels that you have the most examples for. If you could give us more information about the data and the problem you're trying to solve, we could give some more specific recommendations.

3 replies

LeoAlvesRodrigues Feb 15, 2022
Author

Hello @pmbaumgartner ! First and foremost thank you so much for reaching out.

I'm gonna try to explain what we have and what we want to accomplish as better as I can.

So we have a really big database that saves forms. These forms are filled and introduced by users. Sometimes the users input the wrong title/label . When the user inputs the wrong title/label the search for this specific form becomes almost impossible because the user made a mistake and it becomes harder for us to access this form. Our forms have around 750 different titles (some of them are pretty similar which will become harder for us to train but we will find a way to make this easier).

The forms consist of the following:
Title/Label : X
Information1 : good information
Information2 : empty information
Information3 : empty information

When I say good information it means that this information/sentences tell us right away what title/label it is and the empty information doesn't really connect to anything. These are expressions are like : "Form read at day xx-xx-xxxx". So we created a title/label on training for the Empty expressions.

Spacy will read these forms like this:
good information, correct title
empty information, empty title

We are also generating data. This means that for example on one specific title we only had around 500 sentences/expressions for that title which isn't that much so we found a way to multiply this data using synonyms to generate a few thousands of sentences/expressions. I want to point out that we are only generating data that we are sure our forms contain so we are not generating random data that will never be seen.

So far this has been working out pretty well giving us really good accuracy but we don't know to what point this will work.

Our final objective is to have one or several models (we're still not sure what's better) that will go through all our forms categorize them and tell us if they have the wrong title.

Thank you for your time for reading this out and I apologise if I couldn't explain it too well.
If you do have any queries on this regard I will be more than happy to answer.

pmbaumgartner Feb 16, 2022

Thanks for taking the time to explain a bit further. I have a few more questions, but it sounds like what you have is working well for now.

How are you structuring the data across all your labels? For example, is good information, correct title a binary classification or a multilabel classification across all titles?

I'm not sure you need to add the empty information to your classifier. It should be learning to discriminate helpful information on it's own.

Finally, it seems like this is might be a data validation problem. Could you edit the form so that users are less likely to input bad information? And possibly add some validation logic? Then you wouldn't have to rely on an imprecise ML system to do the correction for you.

LeoAlvesRodrigues Feb 16, 2022
Author

Hello again!

I forgot to mention that it was a binary classification, my bad. We did try multilabel classification as well but it didn't make much sense for our problem.

We added the empty title because a lot of the times there are sentences/expressions that don't lead to any title/label so we decided to add it just to make sure it also learns empty sentences/expressions.

In a way yes it is a validation problem. In my opinion there are too many unnecessary labels but I don't think we can change that now. We were also thinking about using spacy to create a smart labeling technique. Where after the user inputs information on the form it gives advice on what label it should be. As I mentioned before, there are too many forms with wrong labels/titles so we're hoping in the future when we have everything trained that it will help us label everything correctly.

Once again, thank you for taking your time to help me on this situation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Text Classification Model over hundreds of labels #10204

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Text Classification Model over hundreds of labels #10204

Uh oh!

Uh oh!

LeoAlvesRodrigues Feb 3, 2022

Replies: 1 comment · 3 replies

Uh oh!

pmbaumgartner Feb 11, 2022

Uh oh!

LeoAlvesRodrigues Feb 15, 2022 Author

Uh oh!

pmbaumgartner Feb 16, 2022

Uh oh!

LeoAlvesRodrigues Feb 16, 2022 Author

LeoAlvesRodrigues
Feb 3, 2022

Replies: 1 comment 3 replies

pmbaumgartner
Feb 11, 2022

LeoAlvesRodrigues Feb 15, 2022
Author

LeoAlvesRodrigues Feb 16, 2022
Author