Assign labels/create training data for a large dataset #13786
EY4L
started this conversation in
Help: Best practices
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
What is the most efficient way to process data to create training data if you have a very large dataset and classes, for example 800k rows, with 1.6k classes, training a text classifier?
Currently my approach is to create a category dictionary with all classes {'cat1': 0, ... , 'cat2':1} for the respective class being assigned, then doc=nlp(text) the example and assign the new category dictionary in the doc.cats dict.
I know there are the following speed improvements I can try, but are there any other approaches?
using nlp.pipe and passing a batch of texts
doc.cats dict only needs the class with a '1' value:
Current method:
training_data = [
("This is a sports article.", {"cats": {"sports": 1, "politics": 0, "technology": 0}}),
("This is a political article.", {"cats": {"sports": 0, "politics": 1, "technology": 0}}),
]
Faster approach
training_data = [
("This is a sports article.", {"cats": {"sports": 1}}),
("This is a political article.", {"cats": {"politics": 1}}),
]
Beta Was this translation helpful? Give feedback.
All reactions