Assign labels/create training data for a large dataset #13786

EY4L · 2025-04-04T08:03:32Z

EY4L
Apr 4, 2025

What is the most efficient way to process data to create training data if you have a very large dataset and classes, for example 800k rows, with 1.6k classes, training a text classifier?

Currently my approach is to create a category dictionary with all classes {'cat1': 0, ... , 'cat2':1} for the respective class being assigned, then doc=nlp(text) the example and assign the new category dictionary in the doc.cats dict.

I know there are the following speed improvements I can try, but are there any other approaches?
using nlp.pipe and passing a batch of texts

doc.cats dict only needs the class with a '1' value:
Current method:
training_data = [
("This is a sports article.", {"cats": {"sports": 1, "politics": 0, "technology": 0}}),
("This is a political article.", {"cats": {"sports": 0, "politics": 1, "technology": 0}}),
]

Faster approach
training_data = [
("This is a sports article.", {"cats": {"sports": 1}}),
("This is a political article.", {"cats": {"politics": 1}}),
]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Assign labels/create training data for a large dataset #13786

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Assign labels/create training data for a large dataset #13786

Uh oh!

Uh oh!

EY4L Apr 4, 2025

Replies: 0 comments

EY4L
Apr 4, 2025