Pretraining Tok2Vec - Optimal Number of Samples #8471

ctufts · 2021-06-22T15:49:23Z

ctufts
Jun 22, 2021

Is there an optimal number of samples for --pretraining tok2vec?
As stated in the docs:

The impact of spacy pretrain varies, but it will usually be worth trying if you’re not using a transformer model and you have relatively little training data (for instance, fewer than 5,000 sentences).

I assume the above is the scenario that would benefit from the addition of pretrained tok2vec. However, to perform pretraining I would utilize a significantly larger amount (>>5k) of unlabeled raw text similar to how I would train my own word embeddings (static vectors). Is this correct?

Is there any particular tradeoffs/issues I need to be aware of if the sample used to perform pre-training is too small?

polm · 2021-06-23T07:17:51Z

polm
Jun 23, 2021

So as the phrase "usually be worth trying" suggests, it's really hard to say what the right number of samples is. Usually more is better if you have the compute time to train them.

However, to perform pretraining I would utilize a significantly larger amount (>>5k) of unlabeled raw text similar to how I would train my own word embeddings (static vectors). Is this correct?

Yes, the expectation is that you have more unlabeled text than labeled text. I guess you could try it with 5k documents anyway though.

Is there any particular tradeoffs/issues I need to be aware of if the sample used to perform pre-training is too small?

If you don't have enough data your embeddings will be very random/spiky and you'll get bad results.

The only subtle part of that is that you could run into a scenario where your data is consistent at training time and so weird embeddings aren't a problem, but then later real data you bring in has some kind of semantic drift or other changes and has bad performance. This is the case with any model but it could be exaggerated when working with less data.

1 reply

ctufts Jun 23, 2021
Author

Thanks for the extra insights, that answers all my questions. I'll run some experiments and go from there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Pretraining Tok2Vec - Optimal Number of Samples #8471

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Pretraining Tok2Vec - Optimal Number of Samples #8471

Uh oh!

Uh oh!

ctufts Jun 22, 2021

Replies: 1 comment · 1 reply

Uh oh!

polm Jun 23, 2021

Uh oh!

ctufts Jun 23, 2021 Author

ctufts
Jun 22, 2021

Replies: 1 comment 1 reply

polm
Jun 23, 2021

ctufts Jun 23, 2021
Author