Pretraining Tok2Vec - Optimal Number of Samples #8471
Replies: 1 comment 1 reply
-
So as the phrase "usually be worth trying" suggests, it's really hard to say what the right number of samples is. Usually more is better if you have the compute time to train them.
Yes, the expectation is that you have more unlabeled text than labeled text. I guess you could try it with 5k documents anyway though.
If you don't have enough data your embeddings will be very random/spiky and you'll get bad results. The only subtle part of that is that you could run into a scenario where your data is consistent at training time and so weird embeddings aren't a problem, but then later real data you bring in has some kind of semantic drift or other changes and has bad performance. This is the case with any model but it could be exaggerated when working with less data. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Is there an optimal number of samples for
--pretraining
tok2vec?As stated in the docs:
I assume the above is the scenario that would benefit from the addition of pretrained tok2vec. However, to perform pretraining I would utilize a significantly larger amount (>>5k) of unlabeled raw text similar to how I would train my own word embeddings (static vectors). Is this correct?
Is there any particular tradeoffs/issues I need to be aware of if the sample used to perform pre-training is too small?
Beta Was this translation helpful? Give feedback.
All reactions