Large Dataset vs Small Dataset #13007
blizaga
started this conversation in
Help: Best practices
Replies: 1 comment 4 replies
-
When training models, it's preferable to use as much training data as possible but one must also take diversity into account. A large dataset that only has examples of a small subset of labels/categories will perform relatively poorly over a smaller dataset with a more diverse set of examples. The important point of consideration when it comes to training any model is to ensure that the training data is representative/diverse enough to allow the model to generalize over a wide range on inputs. |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I would like to ask for texcat model with tokenizer transformer, is the best practice with a large dataset or with a dataset that is not too much to use when training the model?
Because when I did a benchmark with spaCy's default commands, for a large dataset the score was not as big as when using a smaller dataset.
Beta Was this translation helpful? Give feedback.
All reactions